linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/9] VFS: Introduce mount context
@ 2017-05-03 16:04 David Howells
  2017-05-03 16:04 ` [PATCH 1/9] Provide a function to create a NUL-terminated string from unterminated data David Howells
                   ` (14 more replies)
  0 siblings, 15 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 16:04 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, dhowells, linux-nfs, linux-kernel, mszeredi


Here are a set of patches to create a mount context prior to setting up a
new mount, populating it with the parsed options/binary data and then
effecting the mount.

This allows namespaces and other information to be conveyed through the
mount procedure.  It also allows extra error information to be returned
(so many things can go wrong during a mount that a small integer isn't
really sufficient to convey the issue).

This also allows Miklós Szeredi's idea of doing:

	fd = fsopen("nfs");
	write(fd, "option=val", ...);
	fsmount(fd, "/mnt");

that he presented at LSF-2017 to be implemented (see the relevant patches
in the series), to which I can add:

	read(fd, error_buffer, ...);

to read back any error message.  I didn't use netlink as that would make it
depend on CONFIG_NET and would introduce network namespacing issues.

I've implemented mount context handling for procfs and nfs.

Further developments:

 (*) Implement mount context support in more filesystems, ext4 being next
     on my list.

 (*) Move the walk-from-root stuff that nfs has to generic code so that you
     can do something akin to:

	mount /dev/sda1:/foo/bar /mnt

     See nfs_follow_remote_path() and mount_subtree().  This is slightly
     tricky in NFS as we have to prevent referral loops.

 (*) Move the pid_ns pointer from struct mount_context to struct
     proc_mount_context as I'm not sure it's necessary for anything other
     than procfs.

 (*) Work out how to get at the error message incurred by submounts
     encountered during nfs_follow_remote_path().

     Should the error message be moved to task_struct and made more
     general, perhaps retrieved with a prctl() function?

 (*) Clean up/consolidate the security functions.  Possibly add a
     validation hook to be called at the same time as the mount context
     validate op.

The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=mount-context

David
---
David Howells (9):
      Provide a function to create a NUL-terminated string from unterminated data
      Clean up whitespace in fs/namespace.c
      VFS: Introduce a mount context
      Implement fsopen() to prepare for a mount
      Implement fsmount() to effect a pre-configured mount
      Sample program for driving fsopen/fsmount
      procfs: Move proc_fill_super() to fs/proc/root.c
      proc: Support the mount context in procfs
      NFS: Support the mount context and fsopen()


 Documentation/filesystems/mounting.txt |  445 ++++++++
 arch/x86/entry/syscalls/syscall_32.tbl |    2 
 arch/x86/entry/syscalls/syscall_64.tbl |    2 
 fs/Makefile                            |    3 
 fs/fsopen.c                            |  295 +++++
 fs/internal.h                          |    2 
 fs/mount.h                             |    3 
 fs/mount_context.c                     |  343 ++++++
 fs/namespace.c                         |  367 ++++++-
 fs/nfs/Makefile                        |    2 
 fs/nfs/client.c                        |   18 
 fs/nfs/internal.h                      |  127 +-
 fs/nfs/mount.c                         | 1539 ++++++++++++++++++++++++++++
 fs/nfs/namespace.c                     |   75 +
 fs/nfs/nfs3_fs.h                       |    2 
 fs/nfs/nfs3client.c                    |    6 
 fs/nfs/nfs3proc.c                      |    1 
 fs/nfs/nfs4_fs.h                       |    4 
 fs/nfs/nfs4client.c                    |   80 +
 fs/nfs/nfs4namespace.c                 |  207 ++--
 fs/nfs/nfs4proc.c                      |    1 
 fs/nfs/nfs4super.c                     |  184 ++-
 fs/nfs/proc.c                          |    1 
 fs/nfs/super.c                         | 1729 ++------------------------------
 fs/proc/inode.c                        |   50 -
 fs/proc/internal.h                     |    6 
 fs/proc/root.c                         |  194 +++-
 fs/super.c                             |   50 +
 include/linux/fs.h                     |   11 
 include/linux/lsm_hooks.h              |   43 +
 include/linux/mount.h                  |   67 +
 include/linux/nfs_xdr.h                |    7 
 include/linux/security.h               |   35 +
 include/linux/string.h                 |    1 
 include/linux/syscalls.h               |    2 
 include/uapi/linux/magic.h             |    1 
 kernel/sys_ni.c                        |    4 
 mm/util.c                              |   22 
 samples/fsmount/test-fsmount.c         |   79 +
 security/security.c                    |   39 +
 security/selinux/hooks.c               |  192 ++++
 41 files changed, 4148 insertions(+), 2093 deletions(-)
 create mode 100644 Documentation/filesystems/mounting.txt
 create mode 100644 fs/fsopen.c
 create mode 100644 fs/mount_context.c
 create mode 100644 fs/nfs/mount.c
 create mode 100644 samples/fsmount/test-fsmount.c

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 1/9] Provide a function to create a NUL-terminated string from unterminated data
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
@ 2017-05-03 16:04 ` David Howells
  2017-05-03 16:55   ` Jeff Layton
                     ` (2 more replies)
  2017-05-03 16:04 ` [PATCH 2/9] Clean up whitespace in fs/namespace.c David Howells
                   ` (13 subsequent siblings)
  14 siblings, 3 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 16:04 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, dhowells, linux-nfs, linux-kernel, mszeredi

Provide a function, kstrcreate(), that will create a NUL-terminated string
from an unterminated character array where the length is known in advance.

This is better than kstrndup() in situations where we already know the
string length as the strnlen() in kstrndup() is superfluous.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/string.h |    1 +
 mm/util.c              |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/string.h b/include/linux/string.h
index 26b6f6a66f83..5596ae56ce0a 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,7 @@ extern void kfree_const(const void *x);
 extern char *kstrdup(const char *s, gfp_t gfp) __malloc;
 extern const char *kstrdup_const(const char *s, gfp_t gfp);
 extern char *kstrndup(const char *s, size_t len, gfp_t gfp);
+extern char *kstrcreate(const char *s, size_t len, gfp_t gfp);
 extern void *kmemdup(const void *src, size_t len, gfp_t gfp);
 
 extern char **argv_split(gfp_t gfp, const char *str, int *argcp);
diff --git a/mm/util.c b/mm/util.c
index 656dc5e37a87..01887bbdb11e 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -103,6 +103,28 @@ char *kstrndup(const char *s, size_t max, gfp_t gfp)
 EXPORT_SYMBOL(kstrndup);
 
 /**
+ * kstrcreate - Create a NUL-terminated string from unterminated data
+ * @s: The data to stringify
+ * @len: The size of the data
+ * @gfp: the GFP mask used in the kmalloc() call when allocating memory
+ */
+char *kstrcreate(const char *s, size_t len, gfp_t gfp)
+{
+	char *buf;
+
+	if (!s)
+		return NULL;
+
+	buf = kmalloc_track_caller(len + 1, gfp);
+	if (buf) {
+		memcpy(buf, s, len);
+		buf[len] = '\0';
+	}
+	return buf;
+}
+EXPORT_SYMBOL(kstrcreate);
+
+/**
  * kmemdup - duplicate region of memory
  *
  * @src: memory region to duplicate

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 2/9] Clean up whitespace in fs/namespace.c
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
  2017-05-03 16:04 ` [PATCH 1/9] Provide a function to create a NUL-terminated string from unterminated data David Howells
@ 2017-05-03 16:04 ` David Howells
  2017-05-03 16:04 ` [PATCH 3/9] VFS: Introduce a mount context David Howells
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 16:04 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, dhowells, linux-nfs, linux-kernel, mszeredi

Clean up line terminal whitespace in fs/namespace.c.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/namespace.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index cc1375eff88c..db034b6afd43 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1659,7 +1659,7 @@ void __detach_mounts(struct dentry *dentry)
 	namespace_unlock();
 }
 
-/* 
+/*
  * Is the caller allowed to modify his namespace?
  */
 static inline bool may_mount(void)
@@ -2213,7 +2213,7 @@ static int do_loopback(struct path *path, const char *old_name,
 
 	err = -EINVAL;
 	if (mnt_ns_loop(old_path.dentry))
-		goto out; 
+		goto out;
 
 	mp = lock_mount(path);
 	err = PTR_ERR(mp);

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
  2017-05-03 16:04 ` [PATCH 1/9] Provide a function to create a NUL-terminated string from unterminated data David Howells
  2017-05-03 16:04 ` [PATCH 2/9] Clean up whitespace in fs/namespace.c David Howells
@ 2017-05-03 16:04 ` David Howells
  2017-05-03 18:13   ` Jeff Layton
                     ` (4 more replies)
  2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
                   ` (11 subsequent siblings)
  14 siblings, 5 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 16:04 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, dhowells, linux-nfs, linux-kernel, mszeredi

Introduce a mount context concept.  This is allocated at the beginning of
the mount procedure and into it is placed:

 (1) Filesystem type.

 (2) Namespaces.

 (3) Device name.

 (4) Superblock flags (MS_*) and mount flags (MNT_*).

 (5) Security details.

 (6) Filesystem-specific data, as set by the mount options.

It also gives a place in which to hang an error message for later retrieval
(see the mount-by-fd syscall later in this series).

Rather than calling fs_type->mount(), a mount_context struct is created and
fs_type->fsopen() is called to set it up.  fs_type->mc_size says how much
should be added on to the mount context for the filesystem's use.

A set of operations have to be set by ->fsopen() to provide freeing,
duplication, option parsing, binary data parsing, validation, mounting and
superblock filling.

It should be noted that, whilst this patch adds a lot of lines of code,
there is quite a bit of duplication with existing code that can be
eliminated should all filesystems be converted over.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/filesystems/mounting.txt |  445 ++++++++++++++++++++++++++++++++
 fs/Makefile                            |    3 
 fs/internal.h                          |    2 
 fs/mount.h                             |    3 
 fs/mount_context.c                     |  343 +++++++++++++++++++++++++
 fs/namespace.c                         |  270 +++++++++++++++++--
 fs/super.c                             |   50 +++-
 include/linux/fs.h                     |   11 +
 include/linux/lsm_hooks.h              |   37 +++
 include/linux/mount.h                  |   67 +++++
 include/linux/security.h               |   29 ++
 security/security.c                    |   32 ++
 security/selinux/hooks.c               |  179 +++++++++++++
 13 files changed, 1435 insertions(+), 36 deletions(-)
 create mode 100644 Documentation/filesystems/mounting.txt
 create mode 100644 fs/mount_context.c

diff --git a/Documentation/filesystems/mounting.txt b/Documentation/filesystems/mounting.txt
new file mode 100644
index 000000000000..a942ccd08376
--- /dev/null
+++ b/Documentation/filesystems/mounting.txt
@@ -0,0 +1,445 @@
+			      ===================
+			      FILESYSTEM MOUNTING
+			      ===================
+
+CONTENTS
+
+ (1) Overview.
+
+ (2) The mount context.
+
+ (3) The mount context operations.
+
+ (4) Mount context security.
+
+ (5) VFS mount context operations.
+
+
+========
+OVERVIEW
+========
+
+The creation of new mounts is now to be done in a multistep process:
+
+ (1) Create a mount context.
+
+ (2) Parse the options and attach them to the mount context.  Options may be
+     passed individually from userspace.
+
+ (3) Validate and pre-process the mount context.
+
+ (4) Perform the mount.
+
+ (5) Return an error message attached to the mount context.
+
+ (6) Destroy the mount context.
+
+To support this, the file_system_type struct gains two new fields:
+
+	unsigned short mc_size;
+
+which indicates how much space the filesystem would like tacked onto the end of
+the mount_context struct for its own purposes, and:
+
+	int (*fsopen)(struct mount_context *mc, struct super_block *src_sb);
+
+which is invoked to set up the filesystem-specific parts of a mount context,
+including the additional space.  The src_sb parameter is used to convey the
+superblock from which the filesystem may draw extra information (such as
+namespaces), for submount (MS_SUBMOUNT) or remount (MS_REMOUNT) purposes or it
+will be NULL.
+
+Note that security initialisation is done *after* the filesystem is called so
+that the namespaces may be adjusted first.
+
+And the super_operations struct gains one:
+
+	int (*remount_fs_mc) (struct super_block *, struct mount_context *);
+
+This shadows the ->remount_fs() operation and takes a prepared mount context
+instead of the mount flags and data page.  It may modify the ms_flags in the
+context for the caller to pick up.
+
+[NOTE] remount_fs_mc is intended as a replacement for remount_fs.
+
+
+=================
+THE MOUNT CONTEXT
+=================
+
+The mount process is governed by a mount context.  This is represented by the
+mount_context structure:
+
+	struct mount_context {
+		const struct mount_context_operations *ops;
+		struct file_system_type *fs;
+		struct user_namespace	*user_ns;
+		struct mnt_namespace	*mnt_ns;
+		struct pid_namespace	*pid_ns;
+		struct net		*net_ns;
+		const struct cred	*cred;
+		char			*device;
+		char			*root_path;
+		void			*security;
+		const char		*error;
+		unsigned int		ms_flags;
+		unsigned int		mnt_flags;
+		bool			mounted;
+		bool			sloppy;
+		bool			silent;
+		enum mount_type		mount_type : 8;
+	};
+
+When allocated, the mount_context struct is extended by ->mc_size bytes as
+specified by the specified file_system_type struct.  This is for use by the
+filesystem.  The filesystem should wrap the struct in its own, e.g.:
+
+	struct nfs_mount_context {
+		struct mount_context mc;
+		...
+	};
+
+placing the mount_context struct first.  container_of() can then be used.
+
+The mount_context fields are as follows:
+
+ (*) const struct mount_context_operations *ops
+
+     These are operations that can be done on a mount context.  See below.
+     This must be set by the ->fsopen() file_system_type operation.
+
+ (*) struct file_system_type *fs
+
+     A pointer to the file_system_type of the filesystem that is being
+     mounted.  This retains a ref on the type owner.
+
+ (*) struct user_namespace *user_ns
+ (*) struct mnt_namespace *mnt_ns
+ (*) struct pid_namespace *pid_ns
+ (*) struct net *net_ns
+
+     This is a subset of the namespaces in use by the invoking process.  This
+     retains a ref on each namespace.  The subscribed namespaces may be
+     replaced by the filesystem to reflect other sources, such as the parent
+     mount superblock on an automount.
+
+ (*) struct cred *cred
+
+     The mounter's credentials.  This retains a ref on the credentials.
+
+ (*) char *device
+
+     This is the device to be mounted.  It may be a block device
+     (e.g. /dev/sda1) or something more exotic, such as the "host:/path" that
+     NFS desires.
+
+ (*) char *root_path
+
+     A path to the place inside the filesystem to actually mount.  This allows
+     a mount and bind-mount to be combined.
+
+     [NOTE] This isn't implemented yet, but NFS has the code to do this which
+     could be moved to the VFS.
+
+ (*) void *security
+
+     A place for the LSMs to hang their security data for the mount.  The
+     relevant security operations are described below.
+
+ (*) const char *error
+
+     A place for the VFS and the filesystem to hang an error message.  This
+     should be in the form of a static string that doesn't need deallocation
+     and the pointer to which can just be overwritten.  Under some
+     circumstances, this can be retrieved by userspace.
+
+     Note that the existence of the error string is expected to be guaranteed
+     by the reference on the file_system_type object held by ->fs or any
+     filesystem-specific reference held in the filesystem context until the
+     ->free() operation is called.
+
+ (*) unsigned int ms_flags
+ (*) unsigned int mnt_flags
+
+     These hold the mount flags.  ms_flags holds MS_* flags and mnt_flags holds
+     MNT_* flags.
+
+ (*) bool mounted
+
+     This is set to true once a mount attempt is made.  This causes an error to
+     be given on subsequent mount attempts with the same context and prevents
+     multiple mount attempts.
+
+ (*) bool sloppy
+ (*) bool silent
+
+     These are set if the sloppy or silent mount options are given.
+
+     [NOTE] sloppy is probably unnecessary when userspace passes over one
+     option at a time since the error can just be ignored if userspace deems it
+     to be unimportant.
+
+     [NOTE] silent is probably redundant with ms_flags & MS_SILENT.
+
+ (*) enum mount_type
+
+     This indicates the type of mount operation.  The available values are:
+
+	MOUNT_TYPE_NEW		-- New mount
+	MOUNT_TYPE_SUBMOUNT	-- New automatic submount of extant mount
+	MOUNT_TYPE_REMOUNT	-- Change an existing mount
+
+The mount context is created by calling __vfs_fsopen(), vfs_fsopen(),
+vfs_mntopen() or vfs_dup_mount_context() and is destroyed with
+put_mount_context().  Note that the structure is not refcounted.
+
+VFS, security and filesystem mount options are set individually with
+vfs_mount_option() or in bulk with generic_monolithic_mount_data().
+
+When mounting, the filesystem is allowed to take data from any of the pointers
+and attach it to the superblock (or whatever), provided it clears the pointer
+in the mount context.
+
+The filesystem is also allowed to allocate resources and pin them with the
+mount context.  For instance, NFS might pin the appropriate protocol version
+module.
+
+
+============================
+THE MOUNT CONTEXT OPERATIONS
+============================
+
+The mount context points to a table of operations:
+
+	struct mount_context_operations {
+		void (*free)(struct mount_context *mc);
+		int (*dup)(struct mount_context *mc, struct mount_context *src);
+		int (*option)(struct mount_context *mc, char *p);
+		int (*monolithic_mount_data)(struct mount_context *mc, void *data);
+		int (*validate)(struct mount_context *mc);
+		struct dentry *(*mount)(struct mount_context *mc);
+		int (*fill_super)(struct super_block *s, struct mount_context *mc);
+	};
+
+These operations are invoked by the various stages of the mount procedure to
+manage the mount context.  They are as follows:
+
+ (*) void (*free)(struct mount_context *mc);
+
+     Called to clean up the filesystem-specific part of the mount context when
+     the context is destroyed.  It should be aware that parts of the context
+     may have been removed and NULL'd out by ->mount().
+
+ (*) int (*dup)(struct mount_context *mc, struct mount_context *src);
+
+     Called when a mount context has been duplicated to get any refs or copy
+     any non-referenced resources held in the filesystem-specific part of the
+     mount context.  An error may be returned to indicate failure to do this.
+
+     [!] Note that if this fails, put_mount_context() will be called
+     	 immediately thereafter, so ->dup() *must* make the
+     	 filesystem-specific part safe for ->free().
+
+ (*) int (*option)(struct mount_context *mc, char *p);
+
+     Called when an option is to be added to the mount context.  p points to
+     the option string, likely in "key[=val]" format.  VFS-specific options
+     will have been weeded out and mc->ms_flags and mc->mnt_flags updated in
+     the context.  Security options will also have been weeded out and
+     mc->security updated.
+
+     If successful, 0 should be returned and a negative error code otherwise.
+     If an ambiguous error (such as -EINVAL) is returned, mc->error should be
+     set in the context to a string that provides more information.
+
+ (*) int (*monolithic_mount_data)(struct mount_context *mc, void *data);
+
+     Called when the mount(2) system call is invoked to pass the entire data
+     page in one go.  If this is expected to be just a list of "key[=val]"
+     items separated by commas, then this may be set to NULL.
+
+     The return value is as for ->option().
+
+     If the filesystem (eg. NFS) needs to examine the data first and then
+     finds it's the standard key-val list then it may pass it off to:
+
+	int generic_monolithic_mount_data(struct mount_context *mc, void *data);
+
+ (*) int (*validate)(struct mount_context *mc);
+
+     Called when all the options have been applied and the mount is about to
+     take place.  It is should check for inconsistencies from mount options
+     and it is also allowed to do preliminary resource acquisition.  For
+     instance, the core NFS module could load the NFS protocol module here.
+
+     Note that if mc->mount_type == MOUNT_TYPE_REMOUNT, some of the options
+     necessary for a new mount may not be set.
+
+     The return value is as for ->option().
+
+ (*) struct dentry *(*mount)(struct mount_context *mc);
+
+     Called to effect a new mount or new submount using the information stored
+     in the mount context (remounts go via a different vector).  It may detach
+     any resources it desires from the mount context and transfer them to the
+     superblock it creates.
+
+     On success it should return the dentry that's at the root of the mount.
+     In future, mc->root_path will then be applied to this.
+
+     In the case of an error, it should return a negative error code and set
+     mc->error.
+
+ (*) int (*fill_super)(struct super_block *s, struct mount_context *mc);
+
+     This is available to be used by things like mount_ns_mc() that are called
+     by ->mount() to transfer information/resources from the mount context to
+     the superblock.
+
+
+======================
+MOUNT CONTEXT SECURITY
+======================
+
+The mount context contains a security points that the LSMs can use for
+building up a security context for the superblock to be mounted.  There are a
+number of operations used by the new mount code for this purpose:
+
+ (*) int security_mount_ctx_alloc(struct mount_context *mc,
+				  struct super_block *src_sb);
+
+     Called to initialise mc->security (which is preset to NULL) and allocate
+     any resources needed.  It should return 0 on success and a negative error
+     code on failure.
+
+     src_sb is non-NULL in the case of a remount (MS_REMOUNT) in which case it
+     indicates the superblock to be remounted or in the case of a submount
+     (MS_SUBMOUNT) in which case it indicates the parent superblock.
+
+ (*) int security_mount_ctx_dup(struct mount_context *mc,
+				struct mount_context *src_mc);
+
+     Called to initialise mc->security (which is preset to NULL) and allocate
+     any resources needed.  The original mount context is pointed to by src_mc
+     and may be used for reference.  It should return 0 on success and a
+     negative error code on failure.
+
+ (*) void security_mount_ctx_free(struct mount_context *mc);
+
+     Called to clean up anything attached to mc->security.  Note that the
+     contents may have been transferred to a superblock and the pointer NULL'd
+     out during mount.
+
+ (*) int security_mount_ctx_option(struct mount_context *mc, char *opt);
+
+     Called for each mount option.  The mount options are in "key[=val]"
+     form.  An active LSM may reject one with an error, pass one over and
+     return 0 or consume one and return 1.  If consumed, the option isn't
+     passed on to the filesystem.
+
+     If it returns an error, it should set mc->error if the error is
+     ambiguous.
+
+ (*) int security_mount_ctx_kern_mount(struct mount_context *mc,
+				       struct super_block *sb);
+
+     Called during mount to verify that the specified superblock is allowed to
+     be mounted and to transfer the security data there.
+
+     On success, it should return 0; otherwise it should return an error and
+     set mc->error to indicate the problem.  It should not return -ENOMEM as
+     this should be taken care of in advance.
+
+     [NOTE] Should I add a security_mount_ctx_validate() operation so that the
+     LSM has the opportunity to allocate stuff and check the options as a
+     whole?
+
+
+============================
+VFS MOUNT CONTEXT OPERATIONS
+============================
+
+There are four operations for creating a mount context and one for destroying
+a context:
+
+ (*) struct mount_context *__vfs_fsopen(struct file_system_type *fs_type,
+					struct super_block *src_sb;
+					unsigned int ms_flags,
+					unsigned int mnt_flags);
+
+     Create a mount context given a filesystem type pointer.  This allocates
+     the mount context, sets the flags, initialises the security and calls
+     fs_type->fsopen() to initialise the filesystem context.
+
+     src_sb can be NULL or it may indicate a superblock that is going to be
+     remounted (MS_REMOUNT) or a superblock that is the parent of a submount
+     (MS_SUBMOUNT).  This superblock is provided as a source of namespace
+     information.
+
+ (*) struct mount_context *vfs_mntopen(struct vfsmount *mnt,
+				       unsigned int ms_flags,
+				       unsigned int mnt_flags);
+
+     Create a mount context from the same filesystem as an extant mount and
+     initialise the mount parameters from the superblock underlying that
+     mount.  This is used by remount.
+
+ (*) struct mount_context *vfs_fsopen(const char *fs_name);
+
+     Create a mount context given a filesystem name.  It is assumed that the
+     mount flags will be passed in as text options later.  This is intended to
+     be called from sys_fsopen().  This copies current's namespaces to the
+     mount context.
+
+ (*) struct mount_context *vfs_dup_mount_context(struct mount_context *src);
+
+     Duplicate a mount context, copying any options noted and duplicating or
+     additionally referencing any resources held therein.  This is available
+     for use where a filesystem has to get a mount within a mount, such as
+     NFS4 does by internally mounting the root of the target server and then
+     doing a private pathwalk to the target directory.
+
+ (*) void put_mount_context(struct mount_context *ctx);
+
+     Destroy a mount context, releasing any resources it holds.  This calls
+     the ->free() operation.  This is intended to be called by anyone who
+     created a mount context.
+
+     [!] Mount contexts are not refcounted, so this causes unconditional
+     	 destruction.
+
+In all the above operations, apart from the put op, the return is a mount
+context pointer or a negative error code.  No error string is saved as the
+error string is only guaranteed as long as the file_system_type is pinned (and
+thus the module).
+
+In the remaining operations, if an error occurs, a negative error code is
+returned and, if not obvious, mc->error should be set to point to a useful
+string.  The string should not be freed.
+
+ (*) struct vfsmount *vfs_kern_mount_mc(struct mount_context *mc);
+
+     Create a mount given the parameters in the specified mount context.  This
+     invokes the ->validate() op and then the ->mount() op.
+
+ (*) struct vfsmount *vfs_submount_mc(const struct dentry *mountpoint,
+				      struct mount_context *mc);
+
+     Create a mount given a mount context and set MS_SUBMOUNT on it.  A
+     wrapper around vfs_kern_mount_mc().  This is intended to be called from
+     filesystems that have automount points (NFS, AFS, ...).
+
+ (*) int vfs_mount_option(struct mount_context *mc, char *data);
+
+     Supply a single mount option to the mount context.  The mount option
+     should likely be in a "key[=val]" string form.  The option is first
+     checked to see if it corresponds to a standard mount flag (in which case
+     it is used to mark an MS_xxx flag and consumed) or a security option (in
+     which case the LSM consumes it) before it is passed on to the filesystem.
+
+ (*) int generic_monolithic_mount_data(struct mount_context *ctx, void *data);
+
+     Parse a sys_mount() data page, assuming the form to be a text list
+     consisting of key[=val] options separated by commas.  Each item in the
+     list is passed to vfs_mount_option().  This is the default when the
+     ->monolithic_mount_data() operation is NULL.
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..308a104a9a07 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,8 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o splice.o sync.o utimes.o \
-		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o
+		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
+		mount_context.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/internal.h b/fs/internal.h
index 076751d90ba2..ef8c5e93f364 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -87,7 +87,7 @@ extern struct file *get_empty_filp(void);
 /*
  * super.c
  */
-extern int do_remount_sb(struct super_block *, int, void *, int);
+extern int do_remount_sb(struct super_block *, int, void *, int, struct mount_context *);
 extern bool trylock_super(struct super_block *sb);
 extern struct dentry *mount_fs(struct file_system_type *,
 			       int, const char *, void *);
diff --git a/fs/mount.h b/fs/mount.h
index 2826543a131d..b1e99b38f2ee 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -108,9 +108,10 @@ static inline void detach_mounts(struct dentry *dentry)
 	__detach_mounts(dentry);
 }
 
-static inline void get_mnt_ns(struct mnt_namespace *ns)
+static inline struct mnt_namespace *get_mnt_ns(struct mnt_namespace *ns)
 {
 	atomic_inc(&ns->count);
+	return ns;
 }
 
 extern seqlock_t mount_lock;
diff --git a/fs/mount_context.c b/fs/mount_context.c
new file mode 100644
index 000000000000..7d765c100bf1
--- /dev/null
+++ b/fs/mount_context.c
@@ -0,0 +1,343 @@
+/* Provide a way to create a mount context within the kernel that can be
+ * configured before mounting.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/nsproxy.h>
+#include <linux/slab.h>
+#include <linux/magic.h>
+#include <linux/security.h>
+#include <linux/parser.h>
+#include <linux/mnt_namespace.h>
+#include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
+#include <net/net_namespace.h>
+#include "mount.h"
+
+static const match_table_t common_set_mount_options = {
+	{ MS_DIRSYNC,		"dirsync" },
+	{ MS_I_VERSION,		"iversion" },
+	{ MS_LAZYTIME,		"lazytime" },
+	{ MS_MANDLOCK,		"mand" },
+	{ MS_NOATIME,		"noatime" },
+	{ MS_NODEV,		"nodev" },
+	{ MS_NODIRATIME,	"nodiratime" },
+	{ MS_NOEXEC,		"noexec" },
+	{ MS_NOSUID,		"nosuid" },
+	{ MS_POSIXACL,		"posixacl" },
+	{ MS_RDONLY,		"ro" },
+	{ MS_REC,		"rec" },
+	{ MS_RELATIME,		"relatime" },
+	{ MS_STRICTATIME,	"strictatime" },
+	{ MS_SYNCHRONOUS,	"sync" },
+	{ MS_VERBOSE,		"verbose" },
+	{ },
+};
+
+static const match_table_t common_clear_mount_options = {
+	{ MS_LAZYTIME,		"nolazytime" },
+	{ MS_MANDLOCK,		"nomand" },
+	{ MS_NODEV,		"dev" },
+	{ MS_NOEXEC,		"exec" },
+	{ MS_NOSUID,		"suid" },
+	{ MS_RDONLY,		"rw" },
+	{ MS_RELATIME,		"norelatime" },
+	{ MS_SILENT,		"silent" },
+	{ MS_STRICTATIME,	"nostrictatime" },
+	{ MS_SYNCHRONOUS,	"async" },
+	{ },
+};
+
+static const match_table_t forbidden_mount_options = {
+	{ MS_BIND,		"bind" },
+	{ MS_KERNMOUNT,		"ro" },
+	{ MS_MOVE,		"move" },
+	{ MS_PRIVATE,		"private" },
+	{ MS_REMOUNT,		"remount" },
+	{ MS_SHARED,		"shared" },
+	{ MS_SLAVE,		"slave" },
+	{ MS_UNBINDABLE,	"unbindable" },
+	{ },
+};
+
+/*
+ * Check for a common mount option.
+ */
+static noinline int vfs_common_mount_option(struct mount_context *mc, char *data)
+{
+	substring_t args[MAX_OPT_ARGS];
+	unsigned int token;
+
+	token = match_token(data, common_set_mount_options, args);
+	if (token) {
+		mc->ms_flags |= token;
+		return 1;
+	}
+
+	token = match_token(data, common_clear_mount_options, args);
+	if (token) {
+		mc->ms_flags &= ~token;
+		return 1;
+	}
+
+	token = match_token(data, forbidden_mount_options, args);
+	if (token) {
+		mc->error = "Mount option, not superblock option";
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/**
+ * vfs_mount_option - Add a single mount option to a mount context
+ * @mc: The mount context to modify
+ * @option: The option to apply.
+ *
+ * A single mount option in string form is applied to the mount being set up in
+ * the mount context.  Certain standard options (for example "ro") are
+ * translated into flag bits without going to the filesystem.  The active
+ * security module allowed to observe and poach options.  Any other options are
+ * passed over to the filesystem to parse.
+ *
+ * This may be called multiple times for a context.
+ *
+ * Returns 0 on success and a negative error code on failure.  In the event of
+ * failure, mc->error may have been set to a non-allocated string that gives
+ * more information.
+ */
+int vfs_mount_option(struct mount_context *mc, char *data)
+{
+	int ret;
+
+	if (mc->mounted)
+		return -EBUSY;
+
+	ret = vfs_common_mount_option(mc, data);
+	if (ret < 0)
+		return ret;
+	if (ret == 1)
+		return 0;
+
+	ret = security_mount_ctx_option(mc, data);
+	if (ret < 0)
+		return ret;
+	if (ret == 1)
+		return 0;
+
+	return mc->ops->option(mc, data);
+}
+EXPORT_SYMBOL(vfs_mount_option);
+
+/**
+ * generic_monolithic_mount_data - Parse key[=val][,key[=val]]* mount data
+ * @mc: The mount context to populate
+ * @data: The data to parse
+ *
+ * Parse a blob of data that's in key[=val][,key[=val]]* form.  This can be
+ * called from the ->monolithic_mount_data() mount context operation.
+ *
+ * Returns 0 on success or the error returned by the ->option() mount context
+ * operation on failure.
+ */
+int generic_monolithic_mount_data(struct mount_context *ctx, void *data)
+{
+	char *options = data, *p;
+	int ret;
+
+	if (!options)
+		return 0;
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		if (*p) {
+			ret = vfs_mount_option(ctx, p);
+			if (ret < 0)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(generic_monolithic_mount_data);
+
+/**
+ * __vfs_fsopen - Open a filesystem and create a mount context
+ * @fs_type: The filesystem type
+ * @src_sb: A superblock from which this one derives (or NULL)
+ * @ms_flags: Superblock flags and op flags (such as MS_REMOUNT)
+ * @mnt_flags: Mountpoint flags, such as MNT_READONLY
+ * @mount_type: Type of mount
+ *
+ * Open a filesystem and create a mount context.  The mount context is
+ * initialised with the supplied flags and, if a submount/automount from
+ * another superblock (@src_sb), may have parameters such as namespaces copied
+ * across from that superblock.
+ */
+struct mount_context *__vfs_fsopen(struct file_system_type *fs_type,
+				   struct super_block *src_sb,
+				   unsigned int ms_flags, unsigned int mnt_flags,
+				   enum mount_type mount_type)
+{
+	struct mount_context *mc;
+	int ret;
+
+	if (fs_type->fsopen && fs_type->mc_size < sizeof(*mc))
+		BUG();
+
+	mc = kzalloc(max_t(size_t, fs_type->mc_size, sizeof(*mc)), GFP_KERNEL);
+	if (!mc)
+		return ERR_PTR(-ENOMEM);
+
+	mc->mount_type = mount_type;
+	mc->ms_flags = ms_flags;
+	mc->mnt_flags = mnt_flags;
+	mc->fs_type = fs_type;
+	get_filesystem(fs_type);
+	mc->mnt_ns = get_mnt_ns(current->nsproxy->mnt_ns);
+	mc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	mc->net_ns = get_net(current->nsproxy->net_ns);
+	mc->user_ns = get_user_ns(current_user_ns());
+	mc->cred = get_current_cred();
+
+
+	/* TODO: Make all filesystems support this unconditionally */
+	if (mc->fs_type->fsopen) {
+		ret = mc->fs_type->fsopen(mc, src_sb);
+		if (ret < 0)
+			goto err_mc;
+	}
+
+	/* Do the security check last because ->fsopen may change the
+	 * namespace subscriptions.
+	 */
+	ret = security_mount_ctx_alloc(mc, src_sb);
+	if (ret < 0)
+		goto err_mc;
+
+	return mc;
+
+err_mc:
+	put_mount_context(mc);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(__vfs_fsopen);
+
+/**
+ * vfs_fsopen - Open a filesystem and create a mount context
+ * @fs_name: The name of the filesystem
+ *
+ * Open a filesystem and create a mount context that will hold the mount
+ * options, device name, security details, etc..  Note that the caller should
+ * check the ->ops pointer in the returned context to determine whether the
+ * filesystem actually supports the mount context itself.
+ */
+struct mount_context *vfs_fsopen(const char *fs_name)
+{
+	struct file_system_type *fs_type;
+	struct mount_context *mc;
+
+	fs_type = get_fs_type(fs_name);
+	if (!fs_type)
+		return ERR_PTR(-ENODEV);
+
+	mc = __vfs_fsopen(fs_type, NULL, 0, 0, MOUNT_TYPE_NEW);
+	put_filesystem(fs_type);
+	return mc;
+}
+EXPORT_SYMBOL(vfs_fsopen);
+
+/**
+ * vfs_mntopen - Create a mount context and initialise it from an extant mount
+ * @mnt: The mountpoint to open
+ * @ms_flags: Superblock flags and op flags (such as MS_REMOUNT)
+ * @mnt_flags: Mountpoint flags, such as MNT_READONLY
+ * @mount_type: Type of mount
+ *
+ * Open a mounted filesystem and create a mount context such that a remount can
+ * be effected.
+ */
+struct mount_context *vfs_mntopen(struct vfsmount *mnt,
+				  unsigned int ms_flags,
+				  unsigned int mnt_flags,
+				  enum mount_type mount_type)
+{
+	return __vfs_fsopen(mnt->mnt_sb->s_type, mnt->mnt_sb,
+			    ms_flags, mnt_flags, mount_type);
+}
+
+/**
+ * vfs_dup_mount_context: Duplicate a mount context.
+ * @src: The mount context to copy.
+ */
+struct mount_context *vfs_dup_mount_context(struct mount_context *src)
+{
+	struct mount_context *mc;
+	int ret;
+
+	if (!src->ops->dup)
+		return ERR_PTR(-ENOTSUPP);
+
+	mc = kmemdup(src, src->fs_type->mc_size, GFP_KERNEL);
+	if (!mc)
+		return ERR_PTR(-ENOMEM);
+
+	mc->device	= NULL;
+	mc->root_path	= NULL;
+	mc->security	= NULL;
+	mc->error	= NULL;
+	get_filesystem(mc->fs_type);
+	get_mnt_ns(mc->mnt_ns);
+	get_pid_ns(mc->pid_ns);
+	get_net(mc->net_ns);
+	get_user_ns(mc->user_ns);
+	get_cred(mc->cred);
+
+	/* Can't call put until we've called ->dup */
+	ret = mc->ops->dup(mc, src);
+	if (ret < 0)
+		goto err_mc;
+
+	ret = security_mount_ctx_dup(mc, src);
+	if (ret < 0)
+		goto err_mc;
+	return mc;
+
+err_mc:
+	put_mount_context(mc);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(vfs_dup_mount_context);
+
+/*
+ * Dispose of a mount context.
+ */
+void put_mount_context(struct mount_context *mc)
+{
+	if (mc->ops && mc->ops->free)
+		mc->ops->free(mc);
+	security_mount_ctx_free(mc);
+	if (mc->mnt_ns)
+		put_mnt_ns(mc->mnt_ns);
+	if (mc->pid_ns)
+		put_pid_ns(mc->pid_ns);
+	if (mc->net_ns)
+		put_net(mc->net_ns);
+	put_user_ns(mc->user_ns);
+	if (mc->cred)
+		put_cred(mc->cred);
+	put_filesystem(mc->fs_type);
+	kfree(mc->device);
+	kfree(mc->root_path);
+	kfree(mc);
+}
+EXPORT_SYMBOL(put_mount_context);
diff --git a/fs/namespace.c b/fs/namespace.c
index db034b6afd43..e0edab9af308 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -25,6 +25,7 @@
 #include <linux/magic.h>
 #include <linux/bootmem.h>
 #include <linux/task_work.h>
+#include <linux/file.h>
 #include <linux/sched/task.h>
 
 #include "pnode.h"
@@ -783,9 +784,14 @@ static void put_mountpoint(struct mountpoint *mp)
 	}
 }
 
+static inline int __check_mnt(struct mount *mnt, struct mnt_namespace *mnt_ns)
+{
+	return mnt->mnt_ns == mnt_ns;
+}
+
 static inline int check_mnt(struct mount *mnt)
 {
-	return mnt->mnt_ns == current->nsproxy->mnt_ns;
+	return __check_mnt(mnt, current->nsproxy->mnt_ns);
 }
 
 /*
@@ -1596,7 +1602,7 @@ static int do_umount(struct mount *mnt, int flags)
 			return -EPERM;
 		down_write(&sb->s_umount);
 		if (!(sb->s_flags & MS_RDONLY))
-			retval = do_remount_sb(sb, MS_RDONLY, NULL, 0);
+			retval = do_remount_sb(sb, MS_RDONLY, NULL, 0, NULL);
 		up_write(&sb->s_umount);
 		return retval;
 	}
@@ -2279,6 +2285,26 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
 }
 
 /*
+ * Parse the monolithic page of mount data given to sys_mount().
+ */
+static int parse_monolithic_mount_data(struct mount_context *mc, void *data)
+{
+	int (*monolithic_mount_data)(struct mount_context *, void *);
+	int ret;
+
+	monolithic_mount_data = mc->ops->monolithic_mount_data;
+	if (!monolithic_mount_data)
+		monolithic_mount_data = generic_monolithic_mount_data;
+
+	ret = monolithic_mount_data(mc, data);
+	if (ret < 0)
+		return ret;
+	if (mc->ops->validate)
+		return mc->ops->validate(mc);
+	return 0;
+}
+
+/*
  * change filesystem flags. dir should be a physical root of filesystem.
  * If you've mounted a non-root directory somewhere and want to do remount
  * on it - tough luck.
@@ -2286,13 +2312,14 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
 static int do_remount(struct path *path, int flags, int mnt_flags,
 		      void *data)
 {
+	struct mount_context *mc = NULL;
 	int err;
 	struct super_block *sb = path->mnt->mnt_sb;
 	struct mount *mnt = real_mount(path->mnt);
+	struct file_system_type *type = sb->s_type;
 
 	if (!check_mnt(mnt))
 		return -EINVAL;
-
 	if (path->dentry != path->mnt->mnt_root)
 		return -EINVAL;
 
@@ -2323,9 +2350,19 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
 		return -EPERM;
 	}
 
-	err = security_sb_remount(sb, data);
-	if (err)
-		return err;
+	if (type->fsopen) {
+		mc = vfs_mntopen(path->mnt, flags, mnt_flags, MOUNT_TYPE_REMOUNT);
+		if (IS_ERR(mc))
+			return PTR_ERR(mc);
+
+		err = parse_monolithic_mount_data(mc, data);
+		if (err < 0)
+			goto err_mc;
+	} else {
+		err = security_sb_remount(sb, data);
+		if (err)
+			return err;
+	}
 
 	down_write(&sb->s_umount);
 	if (flags & MS_BIND)
@@ -2333,7 +2370,7 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
 	else if (!capable(CAP_SYS_ADMIN))
 		err = -EPERM;
 	else
-		err = do_remount_sb(sb, flags, data, 0);
+		err = do_remount_sb(sb, flags, data, 0, mc);
 	if (!err) {
 		lock_mount_hash();
 		mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
@@ -2342,6 +2379,9 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
 		unlock_mount_hash();
 	}
 	up_write(&sb->s_umount);
+err_mc:
+	if (mc)
+		put_mount_context(mc);
 	return err;
 }
 
@@ -2451,7 +2491,8 @@ static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
 /*
  * add a mount into a namespace's mount tree
  */
-static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
+static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags,
+			struct mnt_namespace *mnt_ns)
 {
 	struct mountpoint *mp;
 	struct mount *parent;
@@ -2465,7 +2506,7 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
 
 	parent = real_mount(path->mnt);
 	err = -EINVAL;
-	if (unlikely(!check_mnt(parent))) {
+	if (unlikely(!__check_mnt(parent, mnt_ns))) {
 		/* that's acceptable only for automounts done in private ns */
 		if (!(mnt_flags & MNT_SHRINKABLE))
 			goto unlock;
@@ -2493,42 +2534,73 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
 }
 
 static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
+static int do_new_mount_mc(struct mount_context *mc, struct path *mountpoint,
+			   unsigned int mnt_flags);
 
 /*
  * create a new mount for userspace and request it to be added into the
  * namespace's tree
  */
-static int do_new_mount(struct path *path, const char *fstype, int flags,
+static int do_new_mount(struct path *mountpoint, const char *fstype, int flags,
 			int mnt_flags, const char *name, void *data)
 {
-	struct file_system_type *type;
+	struct mount_context *mc;
 	struct vfsmount *mnt;
 	int err;
 
 	if (!fstype)
 		return -EINVAL;
 
-	type = get_fs_type(fstype);
-	if (!type)
-		return -ENODEV;
+	mc = vfs_fsopen(fstype);
+	if (IS_ERR(mc))
+		return PTR_ERR(mc);
+	mc->ms_flags = flags;
+	mc->mnt_flags = mnt_flags;
 
-	mnt = vfs_kern_mount(type, flags, name, data);
-	if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
-	    !mnt->mnt_sb->s_subtype)
-		mnt = fs_set_subtype(mnt, fstype);
+	err = -ENOMEM;
+	mc->device = kstrdup(name, GFP_KERNEL);
+	if (!mc->device)
+		goto err_mc;
 
-	put_filesystem(type);
-	if (IS_ERR(mnt))
-		return PTR_ERR(mnt);
+	if (mc->ops) {
+		err = parse_monolithic_mount_data(mc, data);
+		if (err < 0)
+			goto err_mc;
 
-	if (mount_too_revealing(mnt, &mnt_flags)) {
-		mntput(mnt);
-		return -EPERM;
+		err = do_new_mount_mc(mc, mountpoint, mnt_flags);
+		if (err)
+			goto err_mc;
+
+	} else {
+		mnt = vfs_kern_mount(mc->fs_type, flags, name, data);
+		if (!IS_ERR(mnt) && (mc->fs_type->fs_flags & FS_HAS_SUBTYPE) &&
+		    !mnt->mnt_sb->s_subtype)
+			mnt = fs_set_subtype(mnt, fstype);
+
+		if (IS_ERR(mnt)) {
+			err = PTR_ERR(mnt);
+			goto err_mc;
+		}
+
+		err = -EPERM;
+		if (mount_too_revealing(mnt, &mnt_flags))
+			goto err_mnt;
+
+		err = do_add_mount(real_mount(mnt), mountpoint, mnt_flags,
+				   mc->mnt_ns);
+		if (err)
+			goto err_mnt;
 	}
 
-	err = do_add_mount(real_mount(mnt), path, mnt_flags);
-	if (err)
-		mntput(mnt);
+	put_mount_context(mc);
+	return 0;
+
+err_mnt:
+	mntput(mnt);
+err_mc:
+	if (mc->error)
+		pr_info("Mount failed: %s\n", mc->error);
+	put_mount_context(mc);
 	return err;
 }
 
@@ -2547,7 +2619,8 @@ int finish_automount(struct vfsmount *m, struct path *path)
 		goto fail;
 	}
 
-	err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
+	err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE,
+			   current->nsproxy->mnt_ns);
 	if (!err)
 		return 0;
 fail:
@@ -3061,6 +3134,130 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
 	return ret;
 }
 
+static struct dentry *__do_mount_mc(struct mount_context *mc)
+{
+	struct super_block *sb;
+	struct dentry *root;
+	int ret;
+
+	root = mc->ops->mount(mc);
+	if (IS_ERR(root))
+		return root;
+
+	sb = root->d_sb;
+	BUG_ON(!sb);
+	WARN_ON(!sb->s_bdi);
+	sb->s_flags |= MS_BORN;
+
+	ret = security_mount_ctx_kern_mount(mc, sb);
+	if (ret < 0)
+		goto err_sb;
+
+	/*
+	 * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
+	 * but s_maxbytes was an unsigned long long for many releases. Throw
+	 * this warning for a little while to try and catch filesystems that
+	 * violate this rule.
+	 */
+	WARN((sb->s_maxbytes < 0), "%s set sb->s_maxbytes to "
+		"negative value (%lld)\n", mc->fs_type->name, sb->s_maxbytes);
+
+	up_write(&sb->s_umount);
+	return root;
+
+err_sb:
+	dput(root);
+	deactivate_locked_super(sb);
+	return ERR_PTR(ret);
+}
+
+struct vfsmount *vfs_kern_mount_mc(struct mount_context *mc)
+{
+	struct dentry *root;
+	struct mount *mnt;
+	int ret;
+
+	if (mc->ops->validate) {
+		ret = mc->ops->validate(mc);
+		if (ret < 0)
+			return ERR_PTR(ret);
+	}
+
+	mnt = alloc_vfsmnt(mc->device ?: "none");
+	if (!mnt)
+		return ERR_PTR(-ENOMEM);
+
+	if (mc->ms_flags & MS_KERNMOUNT)
+		mnt->mnt.mnt_flags = MNT_INTERNAL;
+
+	root = __do_mount_mc(mc);
+	if (IS_ERR(root)) {
+		mnt_free_id(mnt);
+		free_vfsmnt(mnt);
+		return ERR_CAST(root);
+	}
+
+	mnt->mnt.mnt_root	= root;
+	mnt->mnt.mnt_sb		= root->d_sb;
+	mnt->mnt_mountpoint	= mnt->mnt.mnt_root;
+	mnt->mnt_parent		= mnt;
+	lock_mount_hash();
+	list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
+	unlock_mount_hash();
+	return &mnt->mnt;
+}
+EXPORT_SYMBOL_GPL(vfs_kern_mount_mc);
+
+struct vfsmount *
+vfs_submount_mc(const struct dentry *mountpoint, struct mount_context *mc)
+{
+	/* Until it is worked out how to pass the user namespace
+	 * through from the parent mount to the submount don't support
+	 * unprivileged mounts with submounts.
+	 */
+	if (mountpoint->d_sb->s_user_ns != &init_user_ns)
+		return ERR_PTR(-EPERM);
+
+	mc->ms_flags = MS_SUBMOUNT;
+	return vfs_kern_mount_mc(mc);
+}
+EXPORT_SYMBOL_GPL(vfs_submount_mc);
+
+static int do_new_mount_mc(struct mount_context *mc, struct path *mountpoint,
+			   unsigned int mnt_flags)
+{
+	struct vfsmount *mnt;
+	int ret;
+
+	mnt = vfs_kern_mount_mc(mc);
+	if (IS_ERR(mnt))
+		return PTR_ERR(mnt);
+
+	if ((mc->fs_type->fs_flags & FS_HAS_SUBTYPE) &&
+	    !mnt->mnt_sb->s_subtype) {
+		mnt = fs_set_subtype(mnt, mc->fs_type->name);
+		if (IS_ERR(mnt))
+			return PTR_ERR(mnt);
+	}
+
+	ret = -EPERM;
+	if (mount_too_revealing(mnt, &mnt_flags)) {
+		mc->error = "VFS: Mount too revealing";
+		goto err_mnt;
+	}
+
+	ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags, mc->mnt_ns);
+	if (ret < 0) {
+		mc->error = "VFS: Failed to add mount";
+		goto err_mnt;
+	}
+	return ret;
+
+err_mnt:
+	mntput(mnt);
+	return ret;
+}
+
 /*
  * Return true if path is reachable from root
  *
@@ -3302,6 +3499,23 @@ struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
 }
 EXPORT_SYMBOL_GPL(kern_mount_data);
 
+struct vfsmount *kern_mount_data_mc(struct mount_context *mc)
+{
+	struct vfsmount *mnt;
+
+	mc->ms_flags = MS_KERNMOUNT;
+	mnt = vfs_kern_mount_mc(mc);
+	if (!IS_ERR(mnt)) {
+		/*
+		 * it is a longterm mount, don't release mnt until
+		 * we unmount before file sys is unregistered
+		*/
+		real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
+	}
+	return mnt;
+}
+EXPORT_SYMBOL_GPL(kern_mount_data_mc);
+
 void kern_unmount(struct vfsmount *mnt)
 {
 	/* release long term mount so mount point can be released */
diff --git a/fs/super.c b/fs/super.c
index adb0c0de428c..6e7b86520337 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -805,10 +805,13 @@ struct super_block *user_get_super(dev_t dev)
  *	@flags:	numeric part of options
  *	@data:	the rest of options
  *      @force: whether or not to force the change
+ *	@mc:	the mount context for filesystems that support it
+ *		(NULL if called from emergency or umount)
  *
  *	Alters the mount options of a mounted file system.
  */
-int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
+int do_remount_sb(struct super_block *sb, int flags, void *data, int force,
+		  struct mount_context *mc)
 {
 	int retval;
 	int remount_ro;
@@ -850,8 +853,14 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
 		}
 	}
 
-	if (sb->s_op->remount_fs) {
-		retval = sb->s_op->remount_fs(sb, &flags, data);
+	if (sb->s_op->remount_fs_mc ||
+	    sb->s_op->remount_fs) {
+		if (sb->s_op->remount_fs_mc) {
+		    retval = sb->s_op->remount_fs_mc(sb, mc);
+		    flags = mc->ms_flags;
+		} else {
+			retval = sb->s_op->remount_fs(sb, &flags, data);
+		}
 		if (retval) {
 			if (!force)
 				goto cancel_readonly;
@@ -898,7 +907,7 @@ static void do_emergency_remount(struct work_struct *work)
 			/*
 			 * What lock protects sb->s_flags??
 			 */
-			do_remount_sb(sb, MS_RDONLY, NULL, 1);
+			do_remount_sb(sb, MS_RDONLY, NULL, 1, NULL);
 		}
 		up_write(&sb->s_umount);
 		spin_lock(&sb_lock);
@@ -1048,6 +1057,37 @@ struct dentry *mount_ns(struct file_system_type *fs_type,
 
 EXPORT_SYMBOL(mount_ns);
 
+struct dentry *mount_ns_mc(struct mount_context *mc, void *ns)
+{
+	struct super_block *sb;
+
+	/* Don't allow mounting unless the caller has CAP_SYS_ADMIN
+	 * over the namespace.
+	 */
+	if (!(mc->ms_flags & MS_KERNMOUNT) &&
+	    !ns_capable(mc->user_ns, CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	sb = sget_userns(mc->fs_type, ns_test_super, ns_set_super,
+			 mc->ms_flags, mc->user_ns, ns);
+	if (IS_ERR(sb))
+		return ERR_CAST(sb);
+
+	if (!sb->s_root) {
+		int err;
+		err = mc->ops->fill_super(sb, mc);
+		if (err) {
+			deactivate_locked_super(sb);
+			return ERR_PTR(err);
+		}
+
+		sb->s_flags |= MS_ACTIVE;
+	}
+
+	return dget(sb->s_root);
+}
+EXPORT_SYMBOL(mount_ns_mc);
+
 #ifdef CONFIG_BLOCK
 static int set_bdev_super(struct super_block *s, void *data)
 {
@@ -1196,7 +1236,7 @@ struct dentry *mount_single(struct file_system_type *fs_type,
 		}
 		s->s_flags |= MS_ACTIVE;
 	} else {
-		do_remount_sb(s, flags, data, 0);
+		do_remount_sb(s, flags, data, 0, NULL);
 	}
 	return dget(s->s_root);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 30e5c14bd743..40fe5c5054ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -55,6 +55,7 @@ struct workqueue_struct;
 struct iov_iter;
 struct fscrypt_info;
 struct fscrypt_operations;
+struct mount_context;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -701,6 +702,11 @@ static inline void inode_unlock(struct inode *inode)
 	up_write(&inode->i_rwsem);
 }
 
+static inline int inode_lock_killable(struct inode *inode)
+{
+	return down_write_killable(&inode->i_rwsem);
+}
+
 static inline void inode_lock_shared(struct inode *inode)
 {
 	down_read(&inode->i_rwsem);
@@ -1786,6 +1792,7 @@ struct super_operations {
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
 	int (*remount_fs) (struct super_block *, int *, char *);
+	int (*remount_fs_mc) (struct super_block *, struct mount_context *);
 	void (*umount_begin) (struct super_block *);
 
 	int (*show_options)(struct seq_file *, struct dentry *);
@@ -2020,8 +2027,10 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
+	unsigned short mc_size;		/* Size of mount context to allocate */
 	struct dentry *(*mount) (struct file_system_type *, int,
 		       const char *, void *);
+	int (*fsopen)(struct mount_context *, struct super_block *);
 	void (*kill_sb) (struct super_block *);
 	struct module *owner;
 	struct file_system_type * next;
@@ -2039,6 +2048,7 @@ struct file_system_type {
 
 #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)
 
+extern struct dentry *mount_ns_mc(struct mount_context *mc, void *ns);
 extern struct dentry *mount_ns(struct file_system_type *fs_type,
 	int flags, void *data, void *ns, struct user_namespace *user_ns,
 	int (*fill_super)(struct super_block *, void *, int));
@@ -2105,6 +2115,7 @@ extern int register_filesystem(struct file_system_type *);
 extern int unregister_filesystem(struct file_system_type *);
 extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
 #define kern_mount(type) kern_mount_data(type, NULL)
+extern struct vfsmount *kern_mount_data_mc(struct mount_context *);
 extern void kern_unmount(struct vfsmount *mnt);
 extern int may_umount_tree(struct vfsmount *);
 extern int may_umount(struct vfsmount *);
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index e29d4c62a3c8..f6aa68b8e68e 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -75,6 +75,32 @@
  *	should enable secure mode.
  *	@bprm contains the linux_binprm structure.
  *
+ * Security hooks for mount using fd context.
+ *
+ * @mount_ctx_alloc:
+ *	Allocate and attach a security structure to mc->security.  This pointer
+ *	is initialised to NULL by the caller.
+ *	@mc indicates the new mount context.
+ *	@src_sb indicates the source superblock of a submount.
+ * @mount_ctx_dup:
+ *	Allocate and attach a security structure to mc->security.  This pointer
+ *	is initialised to NULL by the caller.
+ *	@mc indicates the new mount context.
+ *	@src_mc indicates the original mount context.
+ * @mount_ctx_free:
+ *	Clean up a mount context.
+ *	@mc indicates the mount context.
+ * @mount_ctx_option:
+ *	Userspace provided an option to configure a mount.  The LSM may reject
+ *	it with an error and may use it for itself, in which case it should
+ *	return 1; otherwise it should return 0 to pass it on to the filesystem.
+ *	@mc indicates the mount context.
+ *	@p indicates the option in "key[=val]" form.
+ * @mount_ctx_kern_mount:
+ *	Equivalent of sb_kern_mount, but with a mount_context.
+ *	@mc indicates the mount context.
+ *	@src_sb indicates the new superblock.
+ *
  * Security hooks for filesystem operations.
  *
  * @sb_alloc_security:
@@ -1358,6 +1384,12 @@ union security_list_options {
 	void (*bprm_committing_creds)(struct linux_binprm *bprm);
 	void (*bprm_committed_creds)(struct linux_binprm *bprm);
 
+	int (*mount_ctx_alloc)(struct mount_context *mc, struct super_block *src_sb);
+	int (*mount_ctx_dup)(struct mount_context *mc, struct mount_context *src_mc);
+	void (*mount_ctx_free)(struct mount_context *mc);
+	int (*mount_ctx_option)(struct mount_context *mc, char *opt);
+	int (*mount_ctx_kern_mount)(struct mount_context *mc, struct super_block *sb);
+
 	int (*sb_alloc_security)(struct super_block *sb);
 	void (*sb_free_security)(struct super_block *sb);
 	int (*sb_copy_data)(char *orig, char *copy);
@@ -1666,6 +1698,11 @@ struct security_hook_heads {
 	struct list_head bprm_secureexec;
 	struct list_head bprm_committing_creds;
 	struct list_head bprm_committed_creds;
+	struct list_head mount_ctx_alloc;
+	struct list_head mount_ctx_dup;
+	struct list_head mount_ctx_free;
+	struct list_head mount_ctx_option;
+	struct list_head mount_ctx_kern_mount;
 	struct list_head sb_alloc_security;
 	struct list_head sb_free_security;
 	struct list_head sb_copy_data;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 8e0352af06b7..cf2583406986 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -69,6 +69,56 @@ struct vfsmount {
 	int mnt_flags;
 };
 
+struct mount_context;
+struct mount_context_operations {
+	void (*free)(struct mount_context *mc);
+	int (*dup)(struct mount_context *mc, struct mount_context *src);
+	/* An option has been specified. */
+	int (*option)(struct mount_context *mc, char *p);
+	/* Parse monolithic mount data. */
+	int (*monolithic_mount_data)(struct mount_context *mc, void *data);
+	/* Validate the mount options */
+	int (*validate)(struct mount_context *mc);
+	/* Perform the mount. */
+	struct dentry *(*mount)(struct mount_context *mc);
+	/* Fill in a superblock */
+	int (*fill_super)(struct super_block *s, struct mount_context *mc);
+};
+
+enum mount_type {
+	MOUNT_TYPE_NEW,		/* New mount made directly */
+	MOUNT_TYPE_SUBMOUNT,	/* New mount made automatically */
+	MOUNT_TYPE_REMOUNT,	/* Change of an existing mount */
+};
+
+/*
+ * Mount context as allocated and constructed by fsopen().  The filesystem must
+ * support the ->ctx_*() operations.  The size of the object allocated is in
+ * struct file_system_type::mount_context_size; this must be embedded as the
+ * fist thing in the filesystem's own context.
+ */
+struct mount_context {
+	const struct mount_context_operations *ops;
+	struct file_system_type	*fs_type;
+	struct user_namespace	*user_ns;	/* The user namespace for this mount */
+	struct mnt_namespace	*mnt_ns;	/* The mount namespace for this mount */
+	struct pid_namespace	*pid_ns;	/* The process ID namespace for this mount */
+	struct net		*net_ns;	/* The network namespace for this mount */
+	const struct cred	*cred;		/* The mounter's credentials */
+	char			*device;	/* The device name or mount target */
+	char			*root_path;	/* The path within the mount to mount */
+	void			*security;	/* The LSM context */
+	const char		*error;		/* Error string to be read by read() */
+	unsigned int		ms_flags;	/* The superblock flags (MS_*) */
+	unsigned int		mnt_flags;	/* The mount flags (MNT_*) */
+	bool			mounted;	/* Set when mounted */
+	bool			sloppy;		/* Unrecognised options are okay */
+	bool			silent;
+	enum mount_type		mount_type : 8;
+};
+
+extern const struct file_operations fs_fs_fops;
+
 struct file; /* forward dec */
 struct path;
 
@@ -90,9 +140,26 @@ struct file_system_type;
 extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
 				      int flags, const char *name,
 				      void *data);
+extern struct vfsmount *vfs_kern_mount_mc(struct mount_context *mc);
 extern struct vfsmount *vfs_submount(const struct dentry *mountpoint,
 				     struct file_system_type *type,
 				     const char *name, void *data);
+extern struct vfsmount *vfs_submount_mc(const struct dentry *mountpoint,
+					struct mount_context *mc);
+extern struct mount_context *vfs_fsopen(const char *fs_name);
+extern struct mount_context *__vfs_fsopen(struct file_system_type *fs_type,
+					  struct super_block *src_sb,
+					  unsigned int ms_flags,
+					  unsigned int mnt_flags,
+					  enum mount_type mount_type);
+extern struct mount_context *vfs_mntopen(struct vfsmount *mnt,
+					 unsigned int ms_flags,
+					 unsigned int mnt_flags,
+					 enum mount_type mount_type);
+extern struct mount_context *vfs_dup_mount_context(struct mount_context *src);
+extern int vfs_mount_option(struct mount_context *mc, char *data);
+extern int generic_monolithic_mount_data(struct mount_context *ctx, void *data);
+extern void put_mount_context(struct mount_context *ctx);
 
 extern void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list);
 extern void mark_mounts_for_expiry(struct list_head *mounts);
diff --git a/include/linux/security.h b/include/linux/security.h
index 96899fad7016..91efe3039bff 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -55,6 +55,7 @@ struct msg_queue;
 struct xattr;
 struct xfrm_sec_ctx;
 struct mm_struct;
+struct mount_context;
 
 /* If capable should audit the security request */
 #define SECURITY_CAP_NOAUDIT 0
@@ -220,6 +221,11 @@ int security_bprm_check(struct linux_binprm *bprm);
 void security_bprm_committing_creds(struct linux_binprm *bprm);
 void security_bprm_committed_creds(struct linux_binprm *bprm);
 int security_bprm_secureexec(struct linux_binprm *bprm);
+int security_mount_ctx_alloc(struct mount_context *mc, struct super_block *sb);
+int security_mount_ctx_dup(struct mount_context *mc, struct mount_context *src);
+void security_mount_ctx_free(struct mount_context *mc);
+int security_mount_ctx_option(struct mount_context *mc, char *opt);
+int security_mount_ctx_kern_mount(struct mount_context *mc, struct super_block *sb);
 int security_sb_alloc(struct super_block *sb);
 void security_sb_free(struct super_block *sb);
 int security_sb_copy_data(char *orig, char *copy);
@@ -513,6 +519,29 @@ static inline int security_bprm_secureexec(struct linux_binprm *bprm)
 	return cap_bprm_secureexec(bprm);
 }
 
+static inline int security_mount_ctx_alloc(struct mount_context *mc,
+					   struct super_block *src_sb)
+{
+	return 0;
+}
+static inline int security_mount_ctx_dup(struct mount_context *mc,
+					 struct mount_context *src)
+{
+	return 0;
+}
+static inline void security_mount_ctx_free(struct mount_context *mc)
+{
+}
+static inline int security_mount_ctx_option(struct mount_context *mc, char *opt)
+{
+	return 0;
+}
+static inline int security_mount_ctx_kern_mount(struct mount_context *mc,
+						struct super_block *sb)
+{
+	return 0;
+}
+
 static inline int security_sb_alloc(struct super_block *sb)
 {
 	return 0;
diff --git a/security/security.c b/security/security.c
index 23555c5504f6..2e522361df66 100644
--- a/security/security.c
+++ b/security/security.c
@@ -309,6 +309,31 @@ int security_bprm_secureexec(struct linux_binprm *bprm)
 	return call_int_hook(bprm_secureexec, 0, bprm);
 }
 
+int security_mount_ctx_alloc(struct mount_context *mc, struct super_block *src_sb)
+{
+	return call_int_hook(mount_ctx_alloc, 0, mc, src_sb);
+}
+
+int security_mount_ctx_dup(struct mount_context *mc, struct mount_context *src_mc)
+{
+	return call_int_hook(mount_ctx_dup, 0, mc, src_mc);
+}
+
+void security_mount_ctx_free(struct mount_context *mc)
+{
+	call_void_hook(mount_ctx_free, mc);
+}
+
+int security_mount_ctx_option(struct mount_context *mc, char *opt)
+{
+	return call_int_hook(mount_ctx_option, 0, mc, opt);
+}
+
+int security_mount_ctx_kern_mount(struct mount_context *mc, struct super_block *sb)
+{
+	return call_int_hook(mount_ctx_kern_mount, 0, mc, sb);
+}
+
 int security_sb_alloc(struct super_block *sb)
 {
 	return call_int_hook(sb_alloc_security, 0, sb);
@@ -1659,6 +1684,13 @@ struct security_hook_heads security_hook_heads = {
 		LIST_HEAD_INIT(security_hook_heads.bprm_committing_creds),
 	.bprm_committed_creds =
 		LIST_HEAD_INIT(security_hook_heads.bprm_committed_creds),
+	.mount_ctx_alloc = LIST_HEAD_INIT(security_hook_heads.mount_ctx_alloc),
+	.mount_ctx_dup = LIST_HEAD_INIT(security_hook_heads.mount_ctx_dup),
+	.mount_ctx_free = LIST_HEAD_INIT(security_hook_heads.mount_ctx_free),
+	.mount_ctx_option =
+		LIST_HEAD_INIT(security_hook_heads.mount_ctx_option),
+	.mount_ctx_kern_mount =
+		LIST_HEAD_INIT(security_hook_heads.mount_ctx_kern_mount),
 	.sb_alloc_security =
 		LIST_HEAD_INIT(security_hook_heads.sb_alloc_security),
 	.sb_free_security =
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 0c2ac318aa7f..cf38db840f71 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2826,6 +2826,179 @@ static int selinux_umount(struct vfsmount *mnt, int flags)
 				   FILESYSTEM__UNMOUNT, NULL);
 }
 
+/* fsopen mount context operations */
+
+static int selinux_mount_ctx_alloc(struct mount_context *mc,
+				   struct super_block *src_sb)
+{
+	struct security_mnt_opts *opts;
+
+	opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+	if (!opts)
+		return -ENOMEM;
+
+	mc->security = opts;
+	return 0;
+}
+
+static int selinux_mount_ctx_dup(struct mount_context *mc,
+				 struct mount_context *src_mc)
+{
+	const struct security_mnt_opts *src = src_mc->security;
+	struct security_mnt_opts *opts;
+	int i, n;
+
+	opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+	if (!opts)
+		return -ENOMEM;
+	mc->security = opts;
+
+	if (!src || !src->num_mnt_opts)
+		return 0;
+	n = opts->num_mnt_opts = src->num_mnt_opts;
+
+	if (opts->mnt_opts) {
+		opts->mnt_opts = kcalloc(n, sizeof(char *), GFP_KERNEL);
+		if (!opts->mnt_opts)
+			return -ENOMEM;
+
+		for (i = 0; i < n; i++) {
+			if (src->mnt_opts[i]) {
+				opts->mnt_opts[i] = kstrdup(src->mnt_opts[i],
+							    GFP_KERNEL);
+				if (!opts->mnt_opts[i])
+					return -ENOMEM;
+			}
+		}
+	}
+
+	if (src->mnt_opts_flags) {
+		opts->mnt_opts_flags = kmemdup(src->mnt_opts_flags,
+					       n * sizeof(int), GFP_KERNEL);
+		if (!opts->mnt_opts_flags)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void selinux_mount_ctx_free(struct mount_context *mc)
+{
+	struct security_mnt_opts *opts = mc->security;
+
+	security_free_mnt_opts(opts);
+	mc->security = NULL;
+}
+
+static int selinux_mount_ctx_option(struct mount_context *mc, char *opt)
+{
+	struct security_mnt_opts *opts = mc->security;
+	substring_t args[MAX_OPT_ARGS];
+	unsigned int have;
+	char *c, **oo;
+	void *old;
+	int token, ctx, i;
+
+	token = match_token(opt, tokens, args);
+	if (token == Opt_error)
+		return 0; /* Doesn't belong to us. */
+
+	have = 0;
+	for (i = 0; i < opts->num_mnt_opts; i++)
+		have |= 1 << opts->mnt_opts_flags[i];
+	if (have & (1 << token)) {
+		mc->error = "SELinux: Duplicate mount options";
+		return -EINVAL;
+	}
+
+	switch (token) {
+	case Opt_context:
+		if (have & (1 << Opt_defcontext))
+			goto incompatible;
+		ctx = CONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_fscontext:
+		ctx = FSCONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_rootcontext:
+		ctx = ROOTCONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_defcontext:
+		if (have & (1 << Opt_context))
+			goto incompatible;
+		ctx = DEFCONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_labelsupport:
+		return 1;
+
+	default:
+		mc->error = "SELinux: Unknown mount option";
+		return -EINVAL;
+	}
+
+copy_context_string:
+	if (opts->num_mnt_opts > 3) {
+		mc->error = "SELinux: Too many options";
+		return -EINVAL;
+	}
+	if (!opts->mnt_opts_flags) {
+		opts->mnt_opts_flags = kcalloc(3, sizeof(int), GFP_KERNEL);
+		if (!opts->mnt_opts_flags)
+			return -ENOMEM;
+	}
+
+	if (opts->mnt_opts) {
+		oo = kmalloc((opts->num_mnt_opts + 1) * sizeof(char *),
+			     GFP_KERNEL);
+		if (!oo)
+			return -ENOMEM;
+		memcpy(oo, opts->mnt_opts, opts->num_mnt_opts * sizeof(char *));
+		oo[opts->num_mnt_opts] = NULL;
+		old = opts->mnt_opts;
+		opts->mnt_opts = oo;
+		kfree(old);
+	}
+
+	c = match_strdup(&args[0]);
+	if (!c)
+		return -ENOMEM;
+	opts->mnt_opts[opts->num_mnt_opts] = c;
+	opts->mnt_opts_flags[opts->num_mnt_opts] = ctx;
+	opts->num_mnt_opts++;
+	return 1;
+
+incompatible:
+	mc->error = "SELinux: Incompatible mount options";
+	return -EINVAL;
+}
+
+static int selinux_mount_ctx_kern_mount(struct mount_context *mc,
+					struct super_block *sb)
+{
+	const struct cred *cred = current_cred();
+	struct common_audit_data ad;
+	int rc;
+
+	rc = selinux_set_mnt_opts(sb, mc->security, 0, NULL);
+	if (rc)
+		return rc;
+
+	/* Allow all mounts performed by the kernel */
+	if (mc->ms_flags & MS_KERNMOUNT)
+		return 0;
+
+	ad.type = LSM_AUDIT_DATA_DENTRY;
+	ad.u.dentry = sb->s_root;
+	rc = superblock_has_perm(cred, sb, FILESYSTEM__MOUNT, &ad);
+	if (rc < 0)
+		mc->error = "SELinux: Mount of superblock not permitted";
+	return rc;
+}
+
 /* inode security operations */
 
 static int selinux_inode_alloc_security(struct inode *inode)
@@ -6131,6 +6304,12 @@ static struct security_hook_list selinux_hooks[] = {
 	LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),
 	LSM_HOOK_INIT(bprm_secureexec, selinux_bprm_secureexec),
 
+	LSM_HOOK_INIT(mount_ctx_alloc, selinux_mount_ctx_alloc),
+	LSM_HOOK_INIT(mount_ctx_dup, selinux_mount_ctx_dup),
+	LSM_HOOK_INIT(mount_ctx_free, selinux_mount_ctx_free),
+	LSM_HOOK_INIT(mount_ctx_option, selinux_mount_ctx_option),
+	LSM_HOOK_INIT(mount_ctx_kern_mount, selinux_mount_ctx_kern_mount),
+
 	LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
 	LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
 	LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (2 preceding siblings ...)
  2017-05-03 16:04 ` [PATCH 3/9] VFS: Introduce a mount context David Howells
@ 2017-05-03 16:05 ` David Howells
  2017-05-03 18:37   ` Jeff Layton
                     ` (8 more replies)
  2017-05-03 16:05 ` [PATCH 5/9] Implement fsmount() to effect a pre-configured mount David Howells
                   ` (10 subsequent siblings)
  14 siblings, 9 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 16:05 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, dhowells, linux-nfs, linux-kernel, mszeredi

Provide an fsopen() system call that starts the process of preparing to
mount, using an fd as a context handle.  fsopen() is given the name of the
filesystem that will be used:

	int mfd = fsopen(const char *fsname, int reserved,
			 int open_flags);

where reserved should be -1 for the moment (it will be used to pass the
namespace information in future) and open_flags can be 0 or O_CLOEXEC.

For example:

	mfd = fsopen("ext4", -1, O_CLOEXEC);
	write(mfd, "d /dev/sdb1"); // note I'm ignoring write's length arg
	write(mfd, "o noatime");
	write(mfd, "o acl");
	write(mfd, "o user_attr");
	write(mfd, "o iversion");
	write(mfd, "o ");
	write(mfd, "r /my/container"); // root inside the fs
	fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW);

	mfd = fsopen("afs", -1);
	write(mfd, "d %grand.central.org:root.cell");
	write(mfd, "o cell=grand.central.org");
	write(mfd, "r /");
	fsmount(mfd, AT_FDCWD, "/mnt", 0);

If an error is reported at any step, an error message may be available to be
read() back (ENODATA will be reported if there isn't an error available) in
the form:

	"e <subsys>:<problem>"
	"e SELinux:Mount on mountpoint not permitted"

Once fsmount() has been called, further write() calls will incur EBUSY,
even if the fsmount() fails.  read() is still possible to retrieve error
information.

The fsopen() syscall creates a mount context and hangs it of the fd that it
returns.

Netlink is not used because it is optional.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/Makefile                            |    2 
 fs/fsopen.c                            |  295 ++++++++++++++++++++++++++++++++
 include/linux/syscalls.h               |    1 
 include/uapi/linux/magic.h             |    1 
 kernel/sys_ni.c                        |    3 
 7 files changed, 303 insertions(+), 1 deletion(-)
 create mode 100644 fs/fsopen.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac2161112..9bf8d4c62f85 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
 382	i386	pkey_free		sys_pkey_free
 383	i386	statx			sys_statx
 384	i386	arch_prctl		sys_arch_prctl			compat_sys_arch_prctl
+385	i386	fsopen			sys_fsopen
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..9b198c5fc412 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330	common	pkey_alloc		sys_pkey_alloc
 331	common	pkey_free		sys_pkey_free
 332	common	statx			sys_statx
+333	common	fsopen			sys_fsopen
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 308a104a9a07..b79024dbb37c 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -12,7 +12,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o splice.o sync.o utimes.o \
 		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
-		mount_context.o
+		mount_context.o fsopen.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fsopen.c b/fs/fsopen.c
new file mode 100644
index 000000000000..f02ea7d265db
--- /dev/null
+++ b/fs/fsopen.c
@@ -0,0 +1,295 @@
+/* fsopen.c: description
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/mount.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/syscalls.h>
+
+static struct vfsmount *fs_fs_mnt __read_mostly;
+static struct qstr empty_name = { .name = "" };
+
+static int fs_fs_release(struct inode *inode, struct file *file)
+{
+	struct mount_context *mc = file->private_data;
+
+	file->private_data = NULL;
+
+	put_mount_context(mc);
+	return 0;
+}
+
+/*
+ * Read any error message back from the fd.  Will be prefixed by "e ".
+ */
+static ssize_t fs_fs_read(struct file *file, char __user *_buf, size_t len, loff_t *pos)
+{
+	struct mount_context *mc = file->private_data;
+	const char *msg;
+	size_t mlen;
+
+	msg = mc->error;
+	if (!msg)
+		return -ENODATA;
+
+	mlen = strlen(msg);
+	if (mlen + 2 > len)
+		return -ETOOSMALL;
+	if (copy_to_user(_buf, "e ", 2) != 0 ||
+	    copy_to_user(_buf + 2, msg, mlen) != 0)
+		return -EFAULT;
+	return mlen + 2;
+}
+
+/*
+ * Userspace writes configuration data to the fd and we parse it here.  For the
+ * moment, we assume a single option per write.  Each line written is of the form
+ *
+ *	<option_type><space><stuff...>
+ *
+ *	d /dev/sda1				-- Device name
+ *	o noatime				-- Option without value
+ *	o cell=grand.central.org		-- Option with value
+ *	r /					-- Dir within device to mount
+ */
+static ssize_t fs_fs_write(struct file *file,
+			   const char __user *_buf, size_t len, loff_t *pos)
+{
+	struct mount_context *mc = file->private_data;
+	struct inode *inode = file_inode(file);
+	char opt[2], *data;
+	ssize_t ret;
+
+	if (len < 3 || len > 4095)
+		return -EINVAL;
+
+	if (copy_from_user(opt, _buf, 2) != 0)
+		return -EFAULT;
+	switch (opt[0]) {
+	case 'd':
+	case 'o':
+	case 'r':
+		break;
+	default:
+		return -EINVAL;
+	}
+	if (opt[1] != ' ')
+		return -EINVAL;
+
+	data = kmalloc(len - 2 + 1, GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(data, _buf + 2, len - 2) != 0)
+		goto err_free;
+	data[len - 2] = 0;
+
+	/* From this point onwards we need to lock the fd against someone
+	 * trying to mount it.
+	 */
+	ret = inode_lock_killable(inode);
+	if (ret < 0)
+		return ret;
+
+	ret = -EBUSY;
+	if (mc->mounted)
+		goto err_unlock;
+
+	ret = -EINVAL;
+	switch (opt[0]) {
+	case 'd':
+		if (mc->device)
+			goto err_unlock;
+		mc->device = data;
+		data = NULL;
+		break;
+
+	case 'o':
+		ret = vfs_mount_option(mc, data);
+		if (ret < 0)
+			goto err_unlock;
+		break;
+
+	case 'r':
+		if (mc->root_path)
+			goto err_unlock;
+		mc->root_path = data;
+		data = NULL;
+		break;
+
+	default:
+		goto err_unlock;
+	}
+
+	ret = len;
+err_unlock:
+	inode_unlock(inode);
+err_free:
+	kfree(data);
+	return ret;
+}
+
+const struct file_operations fs_fs_fops = {
+	.read		= fs_fs_read,
+	.write		= fs_fs_write,
+	.release	= fs_fs_release,
+	.llseek		= no_llseek,
+};
+
+/*
+ * Indicate the name we want to display the filesystem file as.
+ */
+static char *fs_fs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	return dynamic_dname(dentry, buffer, buflen, "fs:[%lu]",
+			     d_inode(dentry)->i_ino);
+}
+
+static const struct dentry_operations fs_fs_dentry_operations = {
+	.d_dname	= fs_fs_dname,
+};
+
+/*
+ * Create a file that can be used to configure a new mount.
+ */
+static struct file *create_fs_file(struct mount_context *mc)
+{
+	struct inode *inode;
+	struct file *f;
+	struct path path;
+	int ret;
+
+	inode = alloc_anon_inode(fs_fs_mnt->mnt_sb);
+	if (!inode)
+		return ERR_PTR(-ENFILE);
+	inode->i_fop = &fs_fs_fops;
+
+	ret = -ENOMEM;
+	path.dentry = d_alloc_pseudo(fs_fs_mnt->mnt_sb, &empty_name);
+	if (!path.dentry)
+		goto err_inode;
+	path.mnt = mntget(fs_fs_mnt);
+
+	d_instantiate(path.dentry, inode);
+
+	f = alloc_file(&path, FMODE_READ | FMODE_WRITE, &fs_fs_fops);
+	if (IS_ERR(f)) {
+		ret = PTR_ERR(f);
+		goto err_file;
+	}
+
+	f->private_data = mc;
+	return f;
+
+err_file:
+	path_put(&path);
+	return ERR_PTR(ret);
+
+err_inode:
+	iput(inode);
+	return ERR_PTR(ret);
+}
+
+static const struct super_operations fs_fs_ops = {
+	.drop_inode	= generic_delete_inode,
+	.destroy_inode	= free_inode_nonrcu,
+	.statfs		= simple_statfs,
+};
+
+static struct dentry *fs_fs_mount(struct file_system_type *fs_type,
+				  int flags, const char *dev_name,
+				  void *data)
+{
+	return mount_pseudo(fs_type, "fs_fs:", &fs_fs_ops,
+			    &fs_fs_dentry_operations, FS_FS_MAGIC);
+}
+
+static struct file_system_type fs_fs_type = {
+	.name		= "fs_fs",
+	.mount		= fs_fs_mount,
+	.kill_sb	= kill_anon_super,
+};
+
+static int __init init_fs_fs(void)
+{
+	int ret;
+
+	ret = register_filesystem(&fs_fs_type);
+	if (ret < 0)
+		panic("Cannot register fs_fs\n");
+
+	fs_fs_mnt = kern_mount(&fs_fs_type);
+	if (IS_ERR(fs_fs_mnt))
+		panic("Cannot mount fs_fs: %ld\n", PTR_ERR(fs_fs_mnt));
+	return 0;
+}
+
+fs_initcall(init_fs_fs);
+
+/*
+ * Open a filesystem by name so that it can be configured for mounting.
+ *
+ * We are allowed to specify a container in which the filesystem will be
+ * opened, thereby indicating which namespaces will be used (notably, which
+ * network namespace will be used for network filesystems).
+ */
+SYSCALL_DEFINE3(fsopen, const char __user *, _fs_name, int, reserved,
+		unsigned int, flags)
+{
+	struct mount_context *mc;
+	struct file *file;
+	const char *fs_name;
+	int fd, ret;
+
+	if (flags & ~O_CLOEXEC || reserved != -1)
+		return -EINVAL;
+
+	fs_name = strndup_user(_fs_name, PAGE_SIZE);
+	if (IS_ERR(fs_name))
+		return PTR_ERR(fs_name);
+
+	mc = vfs_fsopen(fs_name);
+	if (IS_ERR(mc)) {
+		ret = PTR_ERR(mc);
+		goto err_fs_name;
+	}
+
+	ret = -ENOTSUPP;
+	if (!mc->ops)
+		goto err_mc;
+
+	file = create_fs_file(mc);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto err_mc;
+	}
+
+	ret = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (ret < 0)
+		goto err_file;
+
+	fd = ret;
+	fd_install(fd, file);
+	return fd;
+
+err_file:
+	fput(file);
+	return ret;
+
+err_mc:
+	put_mount_context(mc);
+err_fs_name:
+	kfree(fs_name);
+	return ret;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 980c3c9b06f8..91ec8802ad5d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -905,5 +905,6 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
 asmlinkage long sys_pkey_free(int pkey);
 asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 			  unsigned mask, struct statx __user *buffer);
+asmlinkage long sys_fsopen(const char *fs_name, int containerfd, unsigned int flags);
 
 #endif
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index e230af2e6855..88ae83492f7c 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -84,5 +84,6 @@
 #define UDF_SUPER_MAGIC		0x15013346
 #define BALLOON_KVM_MAGIC	0x13661366
 #define ZSMALLOC_MAGIC		0x58295829
+#define FS_FS_MAGIC		0x66736673
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 8acef8576ce9..de1dc63e7e47 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -258,3 +258,6 @@ cond_syscall(sys_membarrier);
 cond_syscall(sys_pkey_mprotect);
 cond_syscall(sys_pkey_alloc);
 cond_syscall(sys_pkey_free);
+
+/* fd-based mount */
+cond_syscall(sys_fsopen);

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 5/9] Implement fsmount() to effect a pre-configured mount
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (3 preceding siblings ...)
  2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
@ 2017-05-03 16:05 ` David Howells
  2017-05-03 16:05 ` [PATCH 6/9] Sample program for driving fsopen/fsmount David Howells
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 16:05 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, dhowells, linux-nfs, linux-kernel, mszeredi

Provide a system call by which a filesystem opened with fsopen() and
configured by a series of writes can be mounted:

	int ret = fsmount(int fsfd, int dfd, const char *path);

where fsfd is the fd returned by fsopen(), dfd and path describe the
mountpoint.  dfd can be AT_FDCWD or an fd open to a directory.

In the event that fsmount() fails, it may be possible to get an error
message by calling read().  If no message is available, ENODATA will be
reported.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/namespace.c                         |   93 ++++++++++++++++++++++++++++++++
 include/linux/lsm_hooks.h              |    6 ++
 include/linux/security.h               |    6 ++
 include/linux/syscalls.h               |    1 
 kernel/sys_ni.c                        |    1 
 security/security.c                    |    7 ++
 security/selinux/hooks.c               |   13 ++++
 9 files changed, 129 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 9bf8d4c62f85..abe6ea95e0e6 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -392,3 +392,4 @@
 383	i386	statx			sys_statx
 384	i386	arch_prctl		sys_arch_prctl			compat_sys_arch_prctl
 385	i386	fsopen			sys_fsopen
+386	i386	fsmount			sys_fsmount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 9b198c5fc412..0977c5079831 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -340,6 +340,7 @@
 331	common	pkey_free		sys_pkey_free
 332	common	statx			sys_statx
 333	common	fsopen			sys_fsopen
+334	common	fsmount			sys_fsmount
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index e0edab9af308..a367b6cb2ac8 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3259,6 +3259,99 @@ static int do_new_mount_mc(struct mount_context *mc, struct path *mountpoint,
 }
 
 /*
+ * Mount a new, prepared superblock (specified by fs_fd) on the location
+ * specified by dfd and dir_name.  dfd can be AT_FDCWD, a dir fd or a container
+ * fd.  This cannot be used for binding, moving or remounting mounts.
+ *
+ * If fd is a container and dir_name is NULL, then we try to make this the root
+ * filesystem of that container.  This requires CONTAINER_NEW_EMPTY_FS_NS to
+ * have been passed when creating the container.  This operation may only be
+ * done once.
+ */
+SYSCALL_DEFINE3(fsmount, int, fs_fd, int, dfd, const char __user *, dir_name)
+{
+	struct mount_context *mc;
+	struct inode *inode;
+	struct path mountpoint;
+	struct fd f = fdget(fs_fd);
+	unsigned int mnt_flags = 0;
+	long ret;
+
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EINVAL;
+	if (f.file->f_op != &fs_fs_fops)
+		goto err_fsfd;
+
+	mc = f.file->private_data;
+
+	ret = -EPERM;
+	if (!may_mount() ||
+	    ((mc->ms_flags & MS_MANDLOCK) && !may_mandlock()))
+		goto err_fsfd;
+
+	/* Prevent further changes. */
+	inode = file_inode(f.file);
+	ret = inode_lock_killable(inode);
+	if (ret < 0)
+		goto err_fsfd;
+	ret = -EBUSY;
+	if (!mc->mounted) {
+		mc->mounted = true;
+		ret = 0;
+	}
+	inode_unlock(inode);
+	if (ret < 0)
+		goto err_fsfd;
+
+	/* Find the mountpoint.  A container can be specified in dfd. */
+	ret = user_path_at(dfd, dir_name, LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT,
+			   &mountpoint);
+	if (ret < 0) {
+		mc->error = "VFS: Mountpoint lookup failed";
+		goto err_fsfd;
+	}
+
+	ret = security_mount_ctx_mountpoint(mc, &mountpoint);
+	if (ret < 0)
+		goto err_mp;
+
+	/* Default to relatime unless overriden */
+	if (!(mc->ms_flags & MS_NOATIME))
+		mnt_flags |= MNT_RELATIME;
+
+	/* Separate the per-mountpoint flags */
+	if (mc->ms_flags & MS_NOSUID)
+		mnt_flags |= MNT_NOSUID;
+	if (mc->ms_flags & MS_NODEV)
+		mnt_flags |= MNT_NODEV;
+	if (mc->ms_flags & MS_NOEXEC)
+		mnt_flags |= MNT_NOEXEC;
+	if (mc->ms_flags & MS_NOATIME)
+		mnt_flags |= MNT_NOATIME;
+	if (mc->ms_flags & MS_NODIRATIME)
+		mnt_flags |= MNT_NODIRATIME;
+	if (mc->ms_flags & MS_STRICTATIME)
+		mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
+	if (mc->ms_flags & MS_RDONLY)
+		mnt_flags |= MNT_READONLY;
+	mc->mnt_flags = mnt_flags;
+
+	mc->ms_flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
+			  MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
+			  MS_STRICTATIME | MS_NOREMOTELOCK | MS_SUBMOUNT);
+
+	ret = do_new_mount_mc(mc, &mountpoint, mnt_flags);
+
+err_mp:
+	path_put(&mountpoint);
+err_fsfd:
+	fdput(f);
+	return ret;
+}
+
+/*
  * Return true if path is reachable from root
  *
  * namespace_sem or mount_lock is held
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index f6aa68b8e68e..fe2bffd7264d 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -100,6 +100,10 @@
  *	Equivalent of sb_kern_mount, but with a mount_context.
  *	@mc indicates the mount context.
  *	@src_sb indicates the new superblock.
+ * @mount_ctx_mountpoint:
+ *	Equivalent of sb_mount, but with a mount_context.
+ *	@mc indicates the mount context.
+ *	@mountpoint indicates the path on which the mount will take place.
  *
  * Security hooks for filesystem operations.
  *
@@ -1389,6 +1393,7 @@ union security_list_options {
 	void (*mount_ctx_free)(struct mount_context *mc);
 	int (*mount_ctx_option)(struct mount_context *mc, char *opt);
 	int (*mount_ctx_kern_mount)(struct mount_context *mc, struct super_block *sb);
+	int (*mount_ctx_mountpoint)(struct mount_context *mc, struct path *mountpoint);
 
 	int (*sb_alloc_security)(struct super_block *sb);
 	void (*sb_free_security)(struct super_block *sb);
@@ -1703,6 +1708,7 @@ struct security_hook_heads {
 	struct list_head mount_ctx_free;
 	struct list_head mount_ctx_option;
 	struct list_head mount_ctx_kern_mount;
+	struct list_head mount_ctx_mountpoint;
 	struct list_head sb_alloc_security;
 	struct list_head sb_free_security;
 	struct list_head sb_copy_data;
diff --git a/include/linux/security.h b/include/linux/security.h
index 91efe3039bff..b427a554033a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -226,6 +226,7 @@ int security_mount_ctx_dup(struct mount_context *mc, struct mount_context *src);
 void security_mount_ctx_free(struct mount_context *mc);
 int security_mount_ctx_option(struct mount_context *mc, char *opt);
 int security_mount_ctx_kern_mount(struct mount_context *mc, struct super_block *sb);
+int security_mount_ctx_mountpoint(struct mount_context *mc, struct path *mountpoint);
 int security_sb_alloc(struct super_block *sb);
 void security_sb_free(struct super_block *sb);
 int security_sb_copy_data(char *orig, char *copy);
@@ -541,6 +542,11 @@ static inline int security_mount_ctx_kern_mount(struct mount_context *mc,
 {
 	return 0;
 }
+static inline int security_mount_ctx_mountpoint(struct mount_context *mc,
+						struct path *mountpoint)
+{
+	return 0;
+}
 
 static inline int security_sb_alloc(struct super_block *sb)
 {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 91ec8802ad5d..9ac7d8ca8c2e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -906,5 +906,6 @@ asmlinkage long sys_pkey_free(int pkey);
 asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 			  unsigned mask, struct statx __user *buffer);
 asmlinkage long sys_fsopen(const char *fs_name, int containerfd, unsigned int flags);
+asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path);
 
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index de1dc63e7e47..a0fe764bd5dd 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -261,3 +261,4 @@ cond_syscall(sys_pkey_free);
 
 /* fd-based mount */
 cond_syscall(sys_fsopen);
+cond_syscall(sys_fsmount);
diff --git a/security/security.c b/security/security.c
index 2e522361df66..56780c1852b5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -334,6 +334,11 @@ int security_mount_ctx_kern_mount(struct mount_context *mc, struct super_block *
 	return call_int_hook(mount_ctx_kern_mount, 0, mc, sb);
 }
 
+int security_mount_ctx_mountpoint(struct mount_context *mc, struct path *mountpoint)
+{
+	return call_int_hook(mount_ctx_mountpoint, 0, mc, mountpoint);
+}
+
 int security_sb_alloc(struct super_block *sb)
 {
 	return call_int_hook(sb_alloc_security, 0, sb);
@@ -1691,6 +1696,8 @@ struct security_hook_heads security_hook_heads = {
 		LIST_HEAD_INIT(security_hook_heads.mount_ctx_option),
 	.mount_ctx_kern_mount =
 		LIST_HEAD_INIT(security_hook_heads.mount_ctx_kern_mount),
+	.mount_ctx_mountpoint =
+		LIST_HEAD_INIT(security_hook_heads.mount_ctx_mountpoint),
 	.sb_alloc_security =
 		LIST_HEAD_INIT(security_hook_heads.sb_alloc_security),
 	.sb_free_security =
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index cf38db840f71..2bd8e73eb9c9 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2999,6 +2999,18 @@ static int selinux_mount_ctx_kern_mount(struct mount_context *mc,
 	return rc;
 }
 
+static int selinux_mount_ctx_mountpoint(struct mount_context *mc,
+					struct path *mountpoint)
+{
+	const struct cred *cred = current_cred();
+	int ret;
+
+	ret = path_has_perm(cred, mountpoint, FILE__MOUNTON);
+	if (ret < 0)
+		mc->error = "SELinux: Mount on mountpoint not permitted";
+	return ret;
+}
+
 /* inode security operations */
 
 static int selinux_inode_alloc_security(struct inode *inode)
@@ -6309,6 +6321,7 @@ static struct security_hook_list selinux_hooks[] = {
 	LSM_HOOK_INIT(mount_ctx_free, selinux_mount_ctx_free),
 	LSM_HOOK_INIT(mount_ctx_option, selinux_mount_ctx_option),
 	LSM_HOOK_INIT(mount_ctx_kern_mount, selinux_mount_ctx_kern_mount),
+	LSM_HOOK_INIT(mount_ctx_mountpoint, selinux_mount_ctx_mountpoint),
 
 	LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
 	LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 6/9] Sample program for driving fsopen/fsmount
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (4 preceding siblings ...)
  2017-05-03 16:05 ` [PATCH 5/9] Implement fsmount() to effect a pre-configured mount David Howells
@ 2017-05-03 16:05 ` David Howells
  2017-05-03 16:05 ` [PATCH 7/9] procfs: Move proc_fill_super() to fs/proc/root.c David Howells
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 16:05 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, dhowells, linux-nfs, linux-kernel, mszeredi


---

 samples/fsmount/test-fsmount.c |   79 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)
 create mode 100644 samples/fsmount/test-fsmount.c

diff --git a/samples/fsmount/test-fsmount.c b/samples/fsmount/test-fsmount.c
new file mode 100644
index 000000000000..715617f86000
--- /dev/null
+++ b/samples/fsmount/test-fsmount.c
@@ -0,0 +1,79 @@
+/* fd-based mount test.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sys/wait.h>
+
+#define E(x) do { if ((x) == -1) { perror(#x); exit(1); } } while(0)
+
+static __attribute__((noreturn))
+void mount_error(int fd, const char *s)
+{
+	char buf[4096];
+	int err, n;
+
+	err = errno;
+	n = read(fd, buf, sizeof(buf));
+	errno = err;
+	if (n > 0) {
+		n -= 2;
+		fprintf(stderr, "Error: '%s': %*.*s: %m\n", s, n, n, buf + 2);
+	} else {
+		fprintf(stderr, "%s: %m\n", s);
+	}
+	exit(1);
+}
+
+#define E_write(fd, s)							\
+	do {								\
+		if (write(fd, s, sizeof(s) - 1) == -1)			\
+			mount_error(fd, s);				\
+	} while (0)
+
+static inline int fsopen(const char *fs_name, int reserved, int flags)
+{
+	return syscall(333, fs_name, reserved, flags);
+}
+
+static inline int fsmount(int fsfd, int dfd, const char *path)
+{
+	return syscall(334, fsfd, dfd, path);
+}
+
+int main()
+{
+	int mfd;
+
+	/* Mount an NFS filesystem */
+	mfd = fsopen("nfs4", -1, 0);
+	if (mfd == -1) {
+		perror("fsopen");
+		exit(1);
+	}
+
+	E_write(mfd, "d warthog:/data");
+	E_write(mfd, "o fsc");
+	E_write(mfd, "o sync");
+	E_write(mfd, "o intr");
+	E_write(mfd, "o vers=4.2");
+	E_write(mfd, "o addr=90.155.74.18");
+	E_write(mfd, "o clientaddr=90.155.74.21");
+	//E_write(mfd, "r /data");
+	if (fsmount(mfd, AT_FDCWD, "/mnt") < 0)
+		mount_error(mfd, "fsmount");
+	E(close(mfd));
+
+	exit(0);
+}

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 7/9] procfs: Move proc_fill_super() to fs/proc/root.c
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (5 preceding siblings ...)
  2017-05-03 16:05 ` [PATCH 6/9] Sample program for driving fsopen/fsmount David Howells
@ 2017-05-03 16:05 ` David Howells
  2017-05-03 16:05 ` [PATCH 8/9] proc: Support the mount context in procfs David Howells
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 16:05 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, dhowells, linux-nfs, linux-kernel, mszeredi

Move proc_fill_super() to fs/proc/root.c as that's where the other
superblock stuff is.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/proc/inode.c    |   48 +-----------------------------------------------
 fs/proc/internal.h |    4 +---
 fs/proc/root.c     |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 49 insertions(+), 51 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2cc7a8030275..194fa2d13b7e 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -22,7 +22,6 @@
 #include <linux/seq_file.h>
 #include <linux/slab.h>
 #include <linux/mount.h>
-#include <linux/magic.h>
 
 #include <linux/uaccess.h>
 
@@ -113,7 +112,7 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
 	return 0;
 }
 
-static const struct super_operations proc_sops = {
+const struct super_operations proc_sops = {
 	.alloc_inode	= proc_alloc_inode,
 	.destroy_inode	= proc_destroy_inode,
 	.drop_inode	= generic_delete_inode,
@@ -470,48 +469,3 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 	       pde_put(de);
 	return inode;
 }
-
-int proc_fill_super(struct super_block *s, void *data, int silent)
-{
-	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
-	struct inode *root_inode;
-	int ret;
-
-	if (!proc_parse_options(data, ns))
-		return -EINVAL;
-
-	/* User space would break if executables or devices appear on proc */
-	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
-	s->s_flags |= MS_NODIRATIME | MS_NOSUID | MS_NOEXEC;
-	s->s_blocksize = 1024;
-	s->s_blocksize_bits = 10;
-	s->s_magic = PROC_SUPER_MAGIC;
-	s->s_op = &proc_sops;
-	s->s_time_gran = 1;
-
-	/*
-	 * procfs isn't actually a stacking filesystem; however, there is
-	 * too much magic going on inside it to permit stacking things on
-	 * top of it
-	 */
-	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-	
-	pde_get(&proc_root);
-	root_inode = proc_get_inode(s, &proc_root);
-	if (!root_inode) {
-		pr_err("proc_fill_super: get root inode failed\n");
-		return -ENOMEM;
-	}
-
-	s->s_root = d_make_root(root_inode);
-	if (!s->s_root) {
-		pr_err("proc_fill_super: allocate dentry failed\n");
-		return -ENOMEM;
-	}
-
-	ret = proc_setup_self(s);
-	if (ret) {
-		return ret;
-	}
-	return proc_setup_thread_self(s);
-}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c5ae09b6c726..b681533f59dd 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -197,13 +197,12 @@ struct pde_opener {
 	struct completion *c;
 };
 extern const struct inode_operations proc_link_inode_operations;
-
 extern const struct inode_operations proc_pid_link_inode_operations;
+extern const struct super_operations proc_sops;
 
 extern void proc_init_inodecache(void);
 void set_proc_pid_nlink(void);
 extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
-extern int proc_fill_super(struct super_block *, void *data, int flags);
 extern void proc_entry_rundown(struct proc_dir_entry *);
 
 /*
@@ -261,7 +260,6 @@ static inline void proc_tty_init(void) {}
  * root.c
  */
 extern struct proc_dir_entry proc_root;
-extern int proc_parse_options(char *options, struct pid_namespace *pid);
 
 extern void proc_self_init(void);
 extern int proc_remount(struct super_block *, int *, char *);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index deecb397daa3..ff2e810e9e64 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -22,6 +22,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/parser.h>
 #include <linux/cred.h>
+#include <linux/magic.h>
 
 #include "internal.h"
 
@@ -35,7 +36,7 @@ static const match_table_t tokens = {
 	{Opt_err, NULL},
 };
 
-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_parse_options(char *options, struct pid_namespace *pid)
 {
 	char *p;
 	substring_t args[MAX_OPT_ARGS];
@@ -77,6 +78,51 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 	return 1;
 }
 
+static int proc_fill_super(struct super_block *s, void *data, int silent)
+{
+	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+	struct inode *root_inode;
+	int ret;
+
+	if (!proc_parse_options(data, ns))
+		return -EINVAL;
+
+	/* User space would break if executables or devices appear on proc */
+	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
+	s->s_flags |= MS_NODIRATIME | MS_NOSUID | MS_NOEXEC;
+	s->s_blocksize = 1024;
+	s->s_blocksize_bits = 10;
+	s->s_magic = PROC_SUPER_MAGIC;
+	s->s_op = &proc_sops;
+	s->s_time_gran = 1;
+
+	/*
+	 * procfs isn't actually a stacking filesystem; however, there is
+	 * too much magic going on inside it to permit stacking things on
+	 * top of it
+	 */
+	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
+	
+	pde_get(&proc_root);
+	root_inode = proc_get_inode(s, &proc_root);
+	if (!root_inode) {
+		pr_err("proc_fill_super: get root inode failed\n");
+		return -ENOMEM;
+	}
+
+	s->s_root = d_make_root(root_inode);
+	if (!s->s_root) {
+		pr_err("proc_fill_super: allocate dentry failed\n");
+		return -ENOMEM;
+	}
+
+	ret = proc_setup_self(s);
+	if (ret) {
+		return ret;
+	}
+	return proc_setup_thread_self(s);
+}
+
 int proc_remount(struct super_block *sb, int *flags, char *data)
 {
 	struct pid_namespace *pid = sb->s_fs_info;

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 8/9] proc: Support the mount context in procfs
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (6 preceding siblings ...)
  2017-05-03 16:05 ` [PATCH 7/9] procfs: Move proc_fill_super() to fs/proc/root.c David Howells
@ 2017-05-03 16:05 ` David Howells
  2017-05-03 16:05 ` [PATCH 9/9] NFS: Support the mount context and fsopen() David Howells
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 16:05 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, dhowells, linux-nfs, linux-kernel, mszeredi


---

 fs/proc/inode.c    |    2 -
 fs/proc/internal.h |    2 -
 fs/proc/root.c     |  158 ++++++++++++++++++++++++++++++++--------------------
 3 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 194fa2d13b7e..9ddaf60c6f93 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -118,7 +118,7 @@ const struct super_operations proc_sops = {
 	.drop_inode	= generic_delete_inode,
 	.evict_inode	= proc_evict_inode,
 	.statfs		= simple_statfs,
-	.remount_fs	= proc_remount,
+	.remount_fs_mc	= proc_remount,
 	.show_options	= proc_show_options,
 };
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index b681533f59dd..68b693478e7e 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -262,7 +262,7 @@ static inline void proc_tty_init(void) {}
 extern struct proc_dir_entry proc_root;
 
 extern void proc_self_init(void);
-extern int proc_remount(struct super_block *, int *, char *);
+extern int proc_remount(struct super_block *, struct mount_context *);
 
 /*
  * task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index ff2e810e9e64..132529a5e896 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,9 +23,17 @@
 #include <linux/parser.h>
 #include <linux/cred.h>
 #include <linux/magic.h>
+#include <linux/slab.h>
 
 #include "internal.h"
 
+struct proc_mount_context {
+	struct mount_context	mc;
+	unsigned long		mask;
+	int			hidepid;
+	int			gid;
+};
+
 enum {
 	Opt_gid, Opt_hidepid, Opt_err,
 };
@@ -36,56 +44,68 @@ static const match_table_t tokens = {
 	{Opt_err, NULL},
 };
 
-static int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_mount_option(struct mount_context *mc, char *p)
 {
-	char *p;
+	struct proc_mount_context *ctx =
+		container_of(mc, struct proc_mount_context, mc);
 	substring_t args[MAX_OPT_ARGS];
-	int option;
-
-	if (!options)
-		return 1;
-
-	while ((p = strsep(&options, ",")) != NULL) {
-		int token;
-		if (!*p)
-			continue;
-
-		args[0].to = args[0].from = NULL;
-		token = match_token(p, tokens, args);
-		switch (token) {
-		case Opt_gid:
-			if (match_int(&args[0], &option))
-				return 0;
-			pid->pid_gid = make_kgid(current_user_ns(), option);
-			break;
-		case Opt_hidepid:
-			if (match_int(&args[0], &option))
-				return 0;
-			if (option < HIDEPID_OFF ||
-			    option > HIDEPID_INVISIBLE) {
-				pr_err("proc: hidepid value must be between 0 and 2.\n");
-				return 0;
-			}
-			pid->hide_pid = option;
-			break;
-		default:
-			pr_err("proc: unrecognized mount option \"%s\" "
-			       "or missing value\n", p);
-			return 0;
+	int token;
+
+	args[0].to = args[0].from = NULL;
+	token = match_token(p, tokens, args);
+	switch (token) {
+	case Opt_gid:
+		if (match_int(&args[0], &ctx->gid)) {
+			mc->error = "procfs: Unparseable gid= argument";
+			return -EINVAL;
+		}
+		break;
+
+	case Opt_hidepid:
+		if (match_int(&args[0], &ctx->hidepid)) {
+			mc->error = "procfs: Unparseable hidepid= argument";
+			return -EINVAL;
 		}
+		if (ctx->hidepid < HIDEPID_OFF ||
+		    ctx->hidepid > HIDEPID_INVISIBLE) {
+			mc->error = "procfs: Invalid hidepid= argument";
+			pr_err("proc: hidepid value must be between 0 and 2.\n");
+			return -EINVAL;
+		}
+		break;
+
+	default:
+		pr_err("proc: unrecognized mount option \"%s\" "
+		       "or missing value\n", p);
+		mc->error = "procfs: Invalid mount option or missing value";
+		return -EINVAL;
 	}
 
-	return 1;
+	ctx->mask |= 1 << token;
+	return 0;
 }
 
-static int proc_fill_super(struct super_block *s, void *data, int silent)
+static void proc_set_options(struct super_block *s,
+			     struct mount_context *mc,
+			     struct pid_namespace *pid_ns,
+			     struct user_namespace *user_ns)
 {
-	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+	struct proc_mount_context *ctx =
+		container_of(mc, struct proc_mount_context, mc);
+
+	if (ctx->mask & (1 << Opt_gid))
+		pid_ns->pid_gid = make_kgid(user_ns, ctx->gid);
+	if (ctx->mask & (1 << Opt_hidepid))
+		pid_ns->hide_pid = ctx->hidepid;
+}
+
+static int proc_fill_super(struct super_block *s, struct mount_context *mc)
+{
+	struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
 	struct inode *root_inode;
 	int ret;
 
-	if (!proc_parse_options(data, ns))
-		return -EINVAL;
+	proc_set_options(s, mc, pid_ns, current_user_ns());
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -102,7 +122,7 @@ static int proc_fill_super(struct super_block *s, void *data, int silent)
 	 * top of it
 	 */
 	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-	
+
 	pde_get(&proc_root);
 	root_inode = proc_get_inode(s, &proc_root);
 	if (!root_inode) {
@@ -123,27 +143,32 @@ static int proc_fill_super(struct super_block *s, void *data, int silent)
 	return proc_setup_thread_self(s);
 }
 
-int proc_remount(struct super_block *sb, int *flags, char *data)
+int proc_remount(struct super_block *sb, struct mount_context *mc)
 {
 	struct pid_namespace *pid = sb->s_fs_info;
 
 	sync_filesystem(sb);
-	return !proc_parse_options(data, pid);
+
+	if (mc)
+		proc_set_options(sb, mc, pid, current_user_ns());
+	return 0;
 }
 
-static struct dentry *proc_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+static struct dentry *proc_mount(struct mount_context *mc)
 {
-	struct pid_namespace *ns;
+	return mount_ns_mc(mc, mc->pid_ns);
+}
 
-	if (flags & MS_KERNMOUNT) {
-		ns = data;
-		data = NULL;
-	} else {
-		ns = task_active_pid_ns(current);
-	}
+static const struct mount_context_operations proc_mount_ctx_ops = {
+	.option		= proc_mount_option,
+	.mount		= proc_mount,
+	.fill_super	= proc_fill_super,
+};
 
-	return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+static int proc_fsopen(struct mount_context *mc, struct super_block *src_sb)
+{
+	mc->ops = &proc_mount_ctx_ops;
+	return 0;
 }
 
 static void proc_kill_sb(struct super_block *sb)
@@ -161,7 +186,8 @@ static void proc_kill_sb(struct super_block *sb)
 
 static struct file_system_type proc_fs_type = {
 	.name		= "proc",
-	.mount		= proc_mount,
+	.fsopen		= proc_fsopen,
+	.mc_size	= sizeof(struct proc_mount_context),
 	.kill_sb	= proc_kill_sb,
 	.fs_flags	= FS_USERNS_MOUNT,
 };
@@ -209,7 +235,7 @@ static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentr
 {
 	if (!proc_pid_lookup(dir, dentry, flags))
 		return NULL;
-	
+
 	return proc_lookup(dir, dentry, flags);
 }
 
@@ -248,12 +274,12 @@ static const struct inode_operations proc_root_inode_operations = {
  * This is the root "inode" in the /proc tree..
  */
 struct proc_dir_entry proc_root = {
-	.low_ino	= PROC_ROOT_INO, 
-	.namelen	= 5, 
-	.mode		= S_IFDIR | S_IRUGO | S_IXUGO, 
-	.nlink		= 2, 
+	.low_ino	= PROC_ROOT_INO,
+	.namelen	= 5,
+	.mode		= S_IFDIR | S_IRUGO | S_IXUGO,
+	.nlink		= 2,
 	.count		= ATOMIC_INIT(1),
-	.proc_iops	= &proc_root_inode_operations, 
+	.proc_iops	= &proc_root_inode_operations,
 	.proc_fops	= &proc_root_operations,
 	.parent		= &proc_root,
 	.subdir		= RB_ROOT,
@@ -262,9 +288,21 @@ struct proc_dir_entry proc_root = {
 
 int pid_ns_prepare_proc(struct pid_namespace *ns)
 {
+	struct mount_context *mc;
 	struct vfsmount *mnt;
 
-	mnt = kern_mount_data(&proc_fs_type, ns);
+	mc = __vfs_fsopen(&proc_fs_type, NULL, 0, 0, MOUNT_TYPE_NEW);
+	if (IS_ERR(mc))
+		return PTR_ERR(mc);
+
+	if (mc->pid_ns != ns) {
+		put_pid_ns(mc->pid_ns);
+		get_pid_ns(ns);
+		mc->pid_ns = ns;
+	}
+
+	mnt = kern_mount_data_mc(mc);
+	put_mount_context(mc);
 	if (IS_ERR(mnt))
 		return PTR_ERR(mnt);
 

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 9/9] NFS: Support the mount context and fsopen()
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (7 preceding siblings ...)
  2017-05-03 16:05 ` [PATCH 8/9] proc: Support the mount context in procfs David Howells
@ 2017-05-03 16:05 ` David Howells
  2017-05-03 16:44 ` [RFC][PATCH 0/9] VFS: Introduce mount context Jeff Layton
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 16:05 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, dhowells, linux-nfs, linux-kernel, mszeredi

Support the fsopen() system call in NFS, parsing the options and filling in
an nfs_parsed_mount_data struct attached to the mount_context struct.  For
example:

	mfd = fsopen("nfs4", -1, 0);
	E_write(mfd, "d warthog:/root");
	E_write(mfd, "o fsc");
	E_write(mfd, "o sync");
	E_write(mfd, "o intr");
	E_write(mfd, "o foo");

Where E_write() is a wrapper around write() that determines the string
length and, if an error occurs, will read() back the error message and
print it.

A future patch will add the actual mounting part.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/nfs/Makefile         |    2 
 fs/nfs/client.c         |   18 
 fs/nfs/internal.h       |  127 ++-
 fs/nfs/mount.c          | 1539 ++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/namespace.c      |   75 +-
 fs/nfs/nfs3_fs.h        |    2 
 fs/nfs/nfs3client.c     |    6 
 fs/nfs/nfs3proc.c       |    1 
 fs/nfs/nfs4_fs.h        |    4 
 fs/nfs/nfs4client.c     |   80 +-
 fs/nfs/nfs4namespace.c  |  207 ++++--
 fs/nfs/nfs4proc.c       |    1 
 fs/nfs/nfs4super.c      |  184 ++---
 fs/nfs/proc.c           |    1 
 fs/nfs/super.c          | 1729 +++--------------------------------------------
 include/linux/nfs_xdr.h |    7 
 16 files changed, 2035 insertions(+), 1948 deletions(-)
 create mode 100644 fs/nfs/mount.c

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 6abdda209642..655fd5e72e60 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_NFS_FS) += nfs.o
 CFLAGS_nfstrace.o += -I$(src)
 nfs-y 			:= client.o dir.o file.o getroot.o inode.o super.o \
 			   io.o direct.o pagelist.o read.o symlink.o unlink.o \
-			   write.o namespace.o mount_clnt.o nfstrace.o
+			   write.o namespace.o mount_clnt.o nfstrace.o mount.o
 nfs-$(CONFIG_ROOT_NFS)	+= nfsroot.o
 nfs-$(CONFIG_SYSCTL)	+= sysctl.o
 nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 04d15a0045e3..a40dafacdc2d 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -652,17 +652,16 @@ EXPORT_SYMBOL_GPL(nfs_init_client);
  * Create a version 2 or 3 client
  */
 static int nfs_init_server(struct nfs_server *server,
-			   const struct nfs_parsed_mount_data *data,
-			   struct nfs_subversion *nfs_mod)
+			   const struct nfs_mount_context *data)
 {
 	struct rpc_timeout timeparms;
 	struct nfs_client_initdata cl_init = {
 		.hostname = data->nfs_server.hostname,
 		.addr = (const struct sockaddr *)&data->nfs_server.address,
 		.addrlen = data->nfs_server.addrlen,
-		.nfs_mod = nfs_mod,
+		.nfs_mod = data->nfs_mod,
 		.proto = data->nfs_server.protocol,
-		.net = data->net,
+		.net = data->mc.net_ns,
 		.timeparms = &timeparms,
 	};
 	struct nfs_client *clp;
@@ -954,8 +953,7 @@ EXPORT_SYMBOL_GPL(nfs_free_server);
  * Create a version 2 or 3 volume record
  * - keyed on server and FSID
  */
-struct nfs_server *nfs_create_server(struct nfs_mount_info *mount_info,
-				     struct nfs_subversion *nfs_mod)
+struct nfs_server *nfs_create_server(struct nfs_mount_context *ctx)
 {
 	struct nfs_server *server;
 	struct nfs_fattr *fattr;
@@ -971,18 +969,18 @@ struct nfs_server *nfs_create_server(struct nfs_mount_info *mount_info,
 		goto error;
 
 	/* Get a client representation */
-	error = nfs_init_server(server, mount_info->parsed, nfs_mod);
+	error = nfs_init_server(server, ctx);
 	if (error < 0)
 		goto error;
 
 	/* Probe the root fh to retrieve its FSID */
-	error = nfs_probe_fsinfo(server, mount_info->mntfh, fattr);
+	error = nfs_probe_fsinfo(server, ctx->mntfh, fattr);
 	if (error < 0)
 		goto error;
 	if (server->nfs_client->rpc_ops->version == 3) {
 		if (server->namelen == 0 || server->namelen > NFS3_MAXNAMLEN)
 			server->namelen = NFS3_MAXNAMLEN;
-		if (!(mount_info->parsed->flags & NFS_MOUNT_NORDIRPLUS))
+		if (!(ctx->flags & NFS_MOUNT_NORDIRPLUS))
 			server->caps |= NFS_CAP_READDIRPLUS;
 	} else {
 		if (server->namelen == 0 || server->namelen > NFS2_MAXNAMLEN)
@@ -990,7 +988,7 @@ struct nfs_server *nfs_create_server(struct nfs_mount_info *mount_info,
 	}
 
 	if (!(fattr->valid & NFS_ATTR_FATTR)) {
-		error = nfs_mod->rpc_ops->getattr(server, mount_info->mntfh, fattr, NULL);
+		error = ctx->nfs_mod->rpc_ops->getattr(server, ctx->mntfh, fattr, NULL);
 		if (error < 0) {
 			dprintk("nfs_create_server: getattr error = %d\n", -error);
 			goto error;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 9dc65d7ae754..8575ed77ecd9 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -6,6 +6,7 @@
 #include <linux/mount.h>
 #include <linux/security.h>
 #include <linux/crc32.h>
+#include <linux/sunrpc/addr.h>
 #include <linux/nfs_page.h>
 
 #define NFS_MS_MASK (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_SYNCHRONOUS)
@@ -35,18 +36,6 @@ static inline int nfs_attr_use_mounted_on_fileid(struct nfs_fattr *fattr)
 	return 1;
 }
 
-struct nfs_clone_mount {
-	const struct super_block *sb;
-	const struct dentry *dentry;
-	struct nfs_fh *fh;
-	struct nfs_fattr *fattr;
-	char *hostname;
-	char *mnt_path;
-	struct sockaddr *addr;
-	size_t addrlen;
-	rpc_authflavor_t authflavor;
-};
-
 /*
  * Note: RFC 1813 doesn't limit the number of auth flavors that
  * a server can return, so make something up.
@@ -81,15 +70,27 @@ struct nfs_client_initdata {
 	const struct rpc_timeout *timeparms;
 };
 
+enum nfs_mount_type {
+	NFS_MOUNT_ORDINARY,
+	NFS_MOUNT_CROSS_DEV,
+	NFS4_MOUNT_REMOTE,
+	NFS4_MOUNT_REFERRAL,
+	NFS4_MOUNT_REMOTE_REFERRAL,
+};
+
 /*
  * In-kernel mount arguments
  */
-struct nfs_parsed_mount_data {
-	int			flags;
+struct nfs_mount_context {
+	struct mount_context	mc;
+	enum nfs_mount_type	mount_type : 8;
+	bool			skip_remount_option_check;
+	bool			need_mount;
+	unsigned int		flags;		/* NFS{,4}_MOUNT_* flags */
 	unsigned int		rsize, wsize;
 	unsigned int		timeo, retrans;
-	unsigned int		acregmin, acregmax,
-				acdirmin, acdirmax;
+	unsigned int		acregmin, acregmax;
+	unsigned int		acdirmin, acdirmax;
 	unsigned int		namlen;
 	unsigned int		options;
 	unsigned int		bsize;
@@ -99,10 +100,14 @@ struct nfs_parsed_mount_data {
 	unsigned int		version;
 	unsigned int		minorversion;
 	char			*fscache_uniq;
-	bool			need_mount;
+	unsigned short		protofamily;
+	unsigned short		mountfamily;
 
 	struct {
-		struct sockaddr_storage	address;
+		union {
+			struct sockaddr	address;
+			struct sockaddr_storage	_address;
+		};
 		size_t			addrlen;
 		char			*hostname;
 		u32			version;
@@ -111,16 +116,36 @@ struct nfs_parsed_mount_data {
 	} mount_server;
 
 	struct {
-		struct sockaddr_storage	address;
+		union {
+			struct sockaddr	address;
+			struct sockaddr_storage	_address;
+		};
 		size_t			addrlen;
 		char			*hostname;
 		char			*export_path;
 		int			port;
 		unsigned short		protocol;
+		unsigned short		export_path_len;
 	} nfs_server;
 
-	struct security_mnt_opts lsm_opts;
-	struct net		*net;
+	struct nfs_fh		*mntfh;
+	struct nfs_subversion	*nfs_mod;
+
+	int (*set_security)(struct super_block *, struct dentry *,
+			    struct nfs_mount_context *);
+
+	/* Information for a cloned mount.  Possibly this should be either
+	 * unioned with the information above or should be a separate
+	 * mount_context derivative.
+	 */
+	struct nfs_clone_mount {
+		struct super_block	*sb;
+		struct dentry		*dentry;
+		struct nfs_fattr	*fattr;
+		bool			cloned;
+	} clone_data;
+
+	char			buf[32];	/* Parse buffer */
 };
 
 /* mount_clnt.c */
@@ -138,14 +163,6 @@ struct nfs_mount_request {
 	struct net		*net;
 };
 
-struct nfs_mount_info {
-	int (*fill_super)(struct super_block *, struct nfs_mount_info *);
-	int (*set_security)(struct super_block *, struct dentry *, struct nfs_mount_info *);
-	struct nfs_parsed_mount_data *parsed;
-	struct nfs_clone_mount *cloned;
-	struct nfs_fh *mntfh;
-};
-
 extern int nfs_mount(struct nfs_mount_request *info);
 extern void nfs_umount(const struct nfs_mount_request *info);
 
@@ -171,13 +188,9 @@ extern struct nfs_client *nfs4_find_client_ident(struct net *, int);
 extern struct nfs_client *
 nfs4_find_client_sessionid(struct net *, const struct sockaddr *,
 				struct nfs4_sessionid *, u32);
-extern struct nfs_server *nfs_create_server(struct nfs_mount_info *,
-					struct nfs_subversion *);
-extern struct nfs_server *nfs4_create_server(
-					struct nfs_mount_info *,
-					struct nfs_subversion *);
-extern struct nfs_server *nfs4_create_referral_server(struct nfs_clone_mount *,
-						      struct nfs_fh *);
+extern struct nfs_server *nfs_create_server(struct nfs_mount_context *);
+extern struct nfs_server *nfs4_create_server(struct nfs_mount_context *);
+extern struct nfs_server *nfs4_create_referral_server(struct nfs_mount_context *);
 extern int nfs4_update_server(struct nfs_server *server, const char *hostname,
 					struct sockaddr *sap, size_t salen,
 					struct net *net);
@@ -229,6 +242,11 @@ extern struct svc_version nfs4_callback_version1;
 extern struct svc_version nfs4_callback_version4;
 
 struct nfs_pageio_descriptor;
+
+/* mount.c */
+extern const char nfs_slash[];
+extern struct dentry *nfs_general_mount(struct nfs_mount_context *ctx);
+
 /* pagelist.c */
 extern int __init nfs_init_nfspagecache(void);
 extern void nfs_destroy_nfspagecache(void);
@@ -390,24 +408,13 @@ extern int nfs_wait_atomic_killable(atomic_t *p);
 /* super.c */
 extern const struct super_operations nfs_sops;
 extern struct file_system_type nfs_fs_type;
-extern struct file_system_type nfs_xdev_fs_type;
-#if IS_ENABLED(CONFIG_NFS_V4)
-extern struct file_system_type nfs4_xdev_fs_type;
-extern struct file_system_type nfs4_referral_fs_type;
-#endif
 bool nfs_auth_info_match(const struct nfs_auth_info *, rpc_authflavor_t);
-struct dentry *nfs_try_mount(int, const char *, struct nfs_mount_info *,
-			struct nfs_subversion *);
-void nfs_initialise_sb(struct super_block *);
-int nfs_set_sb_security(struct super_block *, struct dentry *, struct nfs_mount_info *);
-int nfs_clone_sb_security(struct super_block *, struct dentry *, struct nfs_mount_info *);
-struct dentry *nfs_fs_mount_common(struct nfs_server *, int, const char *,
-				   struct nfs_mount_info *, struct nfs_subversion *);
-struct dentry *nfs_fs_mount(struct file_system_type *, int, const char *, void *);
-struct dentry * nfs_xdev_mount_common(struct file_system_type *, int,
-		const char *, struct nfs_mount_info *);
+struct dentry *nfs_try_mount(struct nfs_mount_context *);
+int nfs_set_sb_security(struct super_block *, struct dentry *, struct nfs_mount_context *);
+int nfs_clone_sb_security(struct super_block *, struct dentry *, struct nfs_mount_context *);
+struct dentry *nfs_fs_mount_common(struct nfs_server *, struct nfs_mount_context *);
 void nfs_kill_super(struct super_block *);
-int nfs_fill_super(struct super_block *, struct nfs_mount_info *);
+int nfs_fill_super(struct super_block *, struct mount_context *);
 
 extern struct rpc_stat nfs_rpcstat;
 
@@ -458,14 +465,13 @@ extern void nfs_read_prepare(struct rpc_task *task, void *calldata);
 extern void nfs_pageio_reset_read_mds(struct nfs_pageio_descriptor *pgio);
 
 /* super.c */
-int nfs_clone_super(struct super_block *, struct nfs_mount_info *);
 void nfs_umount_begin(struct super_block *);
 int  nfs_statfs(struct dentry *, struct kstatfs *);
 int  nfs_show_options(struct seq_file *, struct dentry *);
 int  nfs_show_devname(struct seq_file *, struct dentry *);
 int  nfs_show_path(struct seq_file *, struct dentry *);
 int  nfs_show_stats(struct seq_file *, struct dentry *);
-int nfs_remount(struct super_block *sb, int *flags, char *raw_data);
+int nfs_remount(struct super_block *sb, struct mount_context *mc);
 
 /* write.c */
 extern void nfs_pageio_init_write(struct nfs_pageio_descriptor *pgio,
@@ -765,3 +771,16 @@ static inline bool nfs_error_is_fatal(int err)
 		return false;
 	}
 }
+
+/*
+ * Select between a default port value and a user-specified port value.
+ * If a zero value is set, then autobind will be used.
+ */
+static inline void nfs_set_port(struct sockaddr *sap, int *port,
+				const unsigned short default_port)
+{
+	if (*port == NFS_UNSPEC_PORT)
+		*port = default_port;
+
+	rpc_set_port(sap, *port);
+}
diff --git a/fs/nfs/mount.c b/fs/nfs/mount.c
new file mode 100644
index 000000000000..4716bb0d0fd2
--- /dev/null
+++ b/fs/nfs/mount.c
@@ -0,0 +1,1539 @@
+/* NFS mount handling.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * Split from fs/nfs/super.c:
+ *
+ *  Copyright (C) 1992  Rick Sladkey
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/parser.h>
+#include <linux/nfs_fs.h>
+#include <linux/nfs_mount.h>
+#include <linux/nfs4_mount.h>
+#include "nfs.h"
+#include "internal.h"
+
+#define NFSDBG_FACILITY		NFSDBG_VFS
+
+#if IS_ENABLED(CONFIG_NFS_V3)
+#define NFS_DEFAULT_VERSION 3
+#else
+#define NFS_DEFAULT_VERSION 2
+#endif
+
+enum {
+	/* Mount options that take no arguments */
+	Opt_soft, Opt_hard,
+	Opt_posix, Opt_noposix,
+	Opt_cto, Opt_nocto,
+	Opt_ac, Opt_noac,
+	Opt_lock, Opt_nolock,
+	Opt_udp, Opt_tcp, Opt_rdma,
+	Opt_acl, Opt_noacl,
+	Opt_rdirplus, Opt_nordirplus,
+	Opt_sharecache, Opt_nosharecache,
+	Opt_resvport, Opt_noresvport,
+	Opt_fscache, Opt_nofscache,
+	Opt_migration, Opt_nomigration,
+
+	/* Mount options that take integer arguments */
+	Opt_port,
+	Opt_rsize, Opt_wsize, Opt_bsize,
+	Opt_timeo, Opt_retrans,
+	Opt_acregmin, Opt_acregmax,
+	Opt_acdirmin, Opt_acdirmax,
+	Opt_actimeo,
+	Opt_namelen,
+	Opt_mountport,
+	Opt_mountvers,
+	Opt_minorversion,
+
+	/* Mount options that take string arguments */
+	Opt_nfsvers,
+	Opt_sec, Opt_proto, Opt_mountproto, Opt_mounthost,
+	Opt_addr, Opt_mountaddr, Opt_clientaddr,
+	Opt_lookupcache,
+	Opt_fscache_uniq,
+	Opt_local_lock,
+
+	/* Special mount options */
+	Opt_userspace, Opt_deprecated, Opt_sloppy,
+
+	Opt_err
+};
+
+static const match_table_t nfs_mount_option_tokens = {
+	{ Opt_userspace, "bg" },
+	{ Opt_userspace, "fg" },
+	{ Opt_userspace, "retry=%s" },
+
+	{ Opt_sloppy, "sloppy" },
+
+	{ Opt_soft, "soft" },
+	{ Opt_hard, "hard" },
+	{ Opt_deprecated, "intr" },
+	{ Opt_deprecated, "nointr" },
+	{ Opt_posix, "posix" },
+	{ Opt_noposix, "noposix" },
+	{ Opt_cto, "cto" },
+	{ Opt_nocto, "nocto" },
+	{ Opt_ac, "ac" },
+	{ Opt_noac, "noac" },
+	{ Opt_lock, "lock" },
+	{ Opt_nolock, "nolock" },
+	{ Opt_udp, "udp" },
+	{ Opt_tcp, "tcp" },
+	{ Opt_rdma, "rdma" },
+	{ Opt_acl, "acl" },
+	{ Opt_noacl, "noacl" },
+	{ Opt_rdirplus, "rdirplus" },
+	{ Opt_nordirplus, "nordirplus" },
+	{ Opt_sharecache, "sharecache" },
+	{ Opt_nosharecache, "nosharecache" },
+	{ Opt_resvport, "resvport" },
+	{ Opt_noresvport, "noresvport" },
+	{ Opt_fscache, "fsc" },
+	{ Opt_nofscache, "nofsc" },
+	{ Opt_migration, "migration" },
+	{ Opt_nomigration, "nomigration" },
+
+	{ Opt_port, "port=%s" },
+	{ Opt_rsize, "rsize=%s" },
+	{ Opt_wsize, "wsize=%s" },
+	{ Opt_bsize, "bsize=%s" },
+	{ Opt_timeo, "timeo=%s" },
+	{ Opt_retrans, "retrans=%s" },
+	{ Opt_acregmin, "acregmin=%s" },
+	{ Opt_acregmax, "acregmax=%s" },
+	{ Opt_acdirmin, "acdirmin=%s" },
+	{ Opt_acdirmax, "acdirmax=%s" },
+	{ Opt_actimeo, "actimeo=%s" },
+	{ Opt_namelen, "namlen=%s" },
+	{ Opt_mountport, "mountport=%s" },
+	{ Opt_mountvers, "mountvers=%s" },
+	{ Opt_minorversion, "minorversion=%s" },
+
+	{ Opt_nfsvers, "nfsvers=%s" },
+	{ Opt_nfsvers, "vers=%s" },
+
+	{ Opt_sec, "sec=%s" },
+	{ Opt_proto, "proto=%s" },
+	{ Opt_mountproto, "mountproto=%s" },
+	{ Opt_addr, "addr=%s" },
+	{ Opt_clientaddr, "clientaddr=%s" },
+	{ Opt_mounthost, "mounthost=%s" },
+	{ Opt_mountaddr, "mountaddr=%s" },
+
+	{ Opt_lookupcache, "lookupcache=%s" },
+	{ Opt_fscache_uniq, "fsc=%s" },
+	{ Opt_local_lock, "local_lock=%s" },
+
+	/* The following needs to be listed after all other options */
+	{ Opt_nfsvers, "v%s" },
+
+	{ Opt_err, NULL }
+};
+
+enum {
+	Opt_xprt_udp, Opt_xprt_udp6, Opt_xprt_tcp, Opt_xprt_tcp6, Opt_xprt_rdma,
+	Opt_xprt_rdma6,
+
+	Opt_xprt_err
+};
+
+static const match_table_t nfs_xprt_protocol_tokens = {
+	{ Opt_xprt_udp, "udp" },
+	{ Opt_xprt_udp6, "udp6" },
+	{ Opt_xprt_tcp, "tcp" },
+	{ Opt_xprt_tcp6, "tcp6" },
+	{ Opt_xprt_rdma, "rdma" },
+	{ Opt_xprt_rdma6, "rdma6" },
+
+	{ Opt_xprt_err, NULL }
+};
+
+enum {
+	Opt_sec_none, Opt_sec_sys,
+	Opt_sec_krb5, Opt_sec_krb5i, Opt_sec_krb5p,
+	Opt_sec_lkey, Opt_sec_lkeyi, Opt_sec_lkeyp,
+	Opt_sec_spkm, Opt_sec_spkmi, Opt_sec_spkmp,
+
+	Opt_sec_err
+};
+
+static const match_table_t nfs_secflavor_tokens = {
+	{ Opt_sec_none, "none" },
+	{ Opt_sec_none, "null" },
+	{ Opt_sec_sys, "sys" },
+
+	{ Opt_sec_krb5, "krb5" },
+	{ Opt_sec_krb5i, "krb5i" },
+	{ Opt_sec_krb5p, "krb5p" },
+
+	{ Opt_sec_lkey, "lkey" },
+	{ Opt_sec_lkeyi, "lkeyi" },
+	{ Opt_sec_lkeyp, "lkeyp" },
+
+	{ Opt_sec_spkm, "spkm3" },
+	{ Opt_sec_spkmi, "spkm3i" },
+	{ Opt_sec_spkmp, "spkm3p" },
+
+	{ Opt_sec_err, NULL }
+};
+
+enum {
+	Opt_lookupcache_all, Opt_lookupcache_positive,
+	Opt_lookupcache_none,
+
+	Opt_lookupcache_err
+};
+
+static const match_table_t nfs_lookupcache_tokens = {
+	{ Opt_lookupcache_all, "all" },
+	{ Opt_lookupcache_positive, "pos" },
+	{ Opt_lookupcache_positive, "positive" },
+	{ Opt_lookupcache_none, "none" },
+
+	{ Opt_lookupcache_err, NULL }
+};
+
+enum {
+	Opt_local_lock_all, Opt_local_lock_flock, Opt_local_lock_posix,
+	Opt_local_lock_none,
+
+	Opt_local_lock_err
+};
+
+static const match_table_t nfs_local_lock_tokens = {
+	{ Opt_local_lock_all, "all" },
+	{ Opt_local_lock_flock, "flock" },
+	{ Opt_local_lock_posix, "posix" },
+	{ Opt_local_lock_none, "none" },
+
+	{ Opt_local_lock_err, NULL }
+};
+
+enum {
+	Opt_vers_2, Opt_vers_3, Opt_vers_4, Opt_vers_4_0,
+	Opt_vers_4_1, Opt_vers_4_2,
+
+	Opt_vers_err
+};
+
+static const match_table_t nfs_vers_tokens = {
+	{ Opt_vers_2, "2" },
+	{ Opt_vers_3, "3" },
+	{ Opt_vers_4, "4" },
+	{ Opt_vers_4_0, "4.0" },
+	{ Opt_vers_4_1, "4.1" },
+	{ Opt_vers_4_2, "4.2" },
+
+	{ Opt_vers_err, NULL }
+};
+
+const char nfs_slash[] = "/";
+EXPORT_SYMBOL_GPL(nfs_slash);
+
+/*
+ * Sanity-check a server address provided by the mount command.
+ *
+ * Address family must be initialized, and address must not be
+ * the ANY address for that family.
+ */
+static int nfs_verify_server_address(struct sockaddr *addr)
+{
+	switch (addr->sa_family) {
+	case AF_INET: {
+		struct sockaddr_in *sa = (struct sockaddr_in *)addr;
+		return sa->sin_addr.s_addr != htonl(INADDR_ANY);
+	}
+	case AF_INET6: {
+		struct in6_addr *sa = &((struct sockaddr_in6 *)addr)->sin6_addr;
+		return !ipv6_addr_any(sa);
+	}
+	}
+
+	dfprintk(MOUNT, "NFS: Invalid IP address specified\n");
+	return 0;
+}
+
+/*
+ * Sanity check the NFS transport protocol.
+ *
+ */
+static void nfs_validate_transport_protocol(struct nfs_mount_context *mnt)
+{
+	switch (mnt->nfs_server.protocol) {
+	case XPRT_TRANSPORT_UDP:
+	case XPRT_TRANSPORT_TCP:
+	case XPRT_TRANSPORT_RDMA:
+		break;
+	default:
+		mnt->nfs_server.protocol = XPRT_TRANSPORT_TCP;
+	}
+}
+
+/*
+ * For text based NFSv2/v3 mounts, the mount protocol transport default
+ * settings should depend upon the specified NFS transport.
+ */
+static void nfs_set_mount_transport_protocol(struct nfs_mount_context *mnt)
+{
+	nfs_validate_transport_protocol(mnt);
+
+	if (mnt->mount_server.protocol == XPRT_TRANSPORT_UDP ||
+	    mnt->mount_server.protocol == XPRT_TRANSPORT_TCP)
+			return;
+	switch (mnt->nfs_server.protocol) {
+	case XPRT_TRANSPORT_UDP:
+		mnt->mount_server.protocol = XPRT_TRANSPORT_UDP;
+		break;
+	case XPRT_TRANSPORT_TCP:
+	case XPRT_TRANSPORT_RDMA:
+		mnt->mount_server.protocol = XPRT_TRANSPORT_TCP;
+	}
+}
+
+/*
+ * Add 'flavor' to 'auth_info' if not already present.
+ * Returns true if 'flavor' ends up in the list, false otherwise
+ */
+static bool nfs_auth_info_add(struct nfs_auth_info *auth_info,
+			      rpc_authflavor_t flavor)
+{
+	unsigned int i;
+	unsigned int max_flavor_len = ARRAY_SIZE(auth_info->flavors);
+
+	/* make sure this flavor isn't already in the list */
+	for (i = 0; i < auth_info->flavor_len; i++) {
+		if (flavor == auth_info->flavors[i])
+			return true;
+	}
+
+	if (auth_info->flavor_len + 1 >= max_flavor_len) {
+		dfprintk(MOUNT, "NFS: too many sec= flavors\n");
+		return false;
+	}
+
+	auth_info->flavors[auth_info->flavor_len++] = flavor;
+	return true;
+}
+
+/*
+ * Parse the value of the 'sec=' option.
+ */
+static int nfs_parse_security_flavors(char *value,
+				      struct nfs_mount_context *mnt)
+{
+	substring_t args[MAX_OPT_ARGS];
+	rpc_authflavor_t pseudoflavor;
+	char *p;
+
+	dfprintk(MOUNT, "NFS: parsing sec=%s option\n", value);
+
+	while ((p = strsep(&value, ":")) != NULL) {
+		switch (match_token(p, nfs_secflavor_tokens, args)) {
+		case Opt_sec_none:
+			pseudoflavor = RPC_AUTH_NULL;
+			break;
+		case Opt_sec_sys:
+			pseudoflavor = RPC_AUTH_UNIX;
+			break;
+		case Opt_sec_krb5:
+			pseudoflavor = RPC_AUTH_GSS_KRB5;
+			break;
+		case Opt_sec_krb5i:
+			pseudoflavor = RPC_AUTH_GSS_KRB5I;
+			break;
+		case Opt_sec_krb5p:
+			pseudoflavor = RPC_AUTH_GSS_KRB5P;
+			break;
+		case Opt_sec_lkey:
+			pseudoflavor = RPC_AUTH_GSS_LKEY;
+			break;
+		case Opt_sec_lkeyi:
+			pseudoflavor = RPC_AUTH_GSS_LKEYI;
+			break;
+		case Opt_sec_lkeyp:
+			pseudoflavor = RPC_AUTH_GSS_LKEYP;
+			break;
+		case Opt_sec_spkm:
+			pseudoflavor = RPC_AUTH_GSS_SPKM;
+			break;
+		case Opt_sec_spkmi:
+			pseudoflavor = RPC_AUTH_GSS_SPKMI;
+			break;
+		case Opt_sec_spkmp:
+			pseudoflavor = RPC_AUTH_GSS_SPKMP;
+			break;
+		default:
+			dfprintk(MOUNT,
+				 "NFS: sec= option '%s' not recognized\n", p);
+			return 0;
+		}
+
+		if (!nfs_auth_info_add(&mnt->auth_info, pseudoflavor))
+			return 0;
+	}
+
+	return 1;
+}
+
+static int nfs_parse_version_string(char *string,
+				    struct nfs_mount_context *mnt,
+				    substring_t *args)
+{
+	mnt->flags &= ~NFS_MOUNT_VER3;
+	switch (match_token(string, nfs_vers_tokens, args)) {
+	case Opt_vers_2:
+		mnt->version = 2;
+		break;
+	case Opt_vers_3:
+		mnt->flags |= NFS_MOUNT_VER3;
+		mnt->version = 3;
+		break;
+	case Opt_vers_4:
+		/* Backward compatibility option. In future,
+		 * the mount program should always supply
+		 * a NFSv4 minor version number.
+		 */
+		mnt->version = 4;
+		break;
+	case Opt_vers_4_0:
+		mnt->version = 4;
+		mnt->minorversion = 0;
+		break;
+	case Opt_vers_4_1:
+		mnt->version = 4;
+		mnt->minorversion = 1;
+		break;
+	case Opt_vers_4_2:
+		mnt->version = 4;
+		mnt->minorversion = 2;
+		break;
+	default:
+		return 0;
+	}
+	return 1;
+}
+
+static int nfs_get_option_str(substring_t args[], char **option)
+{
+	kfree(*option);
+	*option = match_strdup(args);
+	return !*option;
+}
+
+static int nfs_get_option_ui(struct nfs_mount_context *mc,
+			     substring_t args[], unsigned int *option)
+{
+	match_strlcpy(mc->buf, args, sizeof(mc->buf));
+	return kstrtouint(mc->buf, 10, option);
+}
+
+static int nfs_get_option_ui_bound(struct nfs_mount_context *mc,
+				   substring_t args[], unsigned int *option,
+				   unsigned int l_bound, unsigned u_bound)
+{
+	int ret;
+
+	match_strlcpy(mc->buf, args, sizeof(mc->buf));
+	ret = kstrtouint(mc->buf, 10, option);
+	if (ret < 0)
+		return ret;
+	if (*option < l_bound || *option > u_bound)
+		return -ERANGE;
+	return 0;
+}
+
+/*
+ * Split mc->device into "hostname:export_path".
+ *
+ * The leftmost colon demarks the split between the server's hostname
+ * and the export path.  If the hostname starts with a left square
+ * bracket, then it may contain colons.
+ *
+ * Note: caller frees hostname and export path, even on error.
+ */
+static int nfs_parse_devname(struct nfs_mount_context *ctx,
+			     size_t maxnamlen, size_t maxpathlen)
+{
+	char *dev_name = ctx->mc.device;
+	size_t len;
+	char *end;
+
+	/* Is the host name protected with square brakcets? */
+	if (*dev_name == '[') {
+		end = strchr(++dev_name, ']');
+		if (end == NULL || end[1] != ':')
+			goto out_bad_devname;
+
+		len = end - dev_name;
+		end++;
+	} else {
+		char *comma;
+
+		end = strchr(dev_name, ':');
+		if (end == NULL)
+			goto out_bad_devname;
+		len = end - dev_name;
+
+		/* kill possible hostname list: not supported */
+		comma = strchr(dev_name, ',');
+		if (comma != NULL && comma < end)
+			*comma = 0;
+	}
+
+	if (len > maxnamlen)
+		goto out_hostname;
+
+	/* N.B. caller will free nfs_server.hostname in all cases */
+	ctx->nfs_server.hostname = kstrcreate(dev_name, len, GFP_KERNEL);
+	if (!ctx->nfs_server.hostname)
+		goto out_nomem;
+	len = strlen(++end);
+	if (len > maxpathlen)
+		goto out_path;
+	ctx->nfs_server.export_path = kstrcreate(end, len, GFP_KERNEL);
+	if (!ctx->nfs_server.export_path)
+		goto out_nomem;
+
+	dfprintk(MOUNT, "NFS: MNTPATH: '%s'\n", *export_path);
+	return 0;
+
+out_bad_devname:
+	ctx->mc.error = "NFS: device name not in host:path format";
+	return -EINVAL;
+out_nomem:
+	ctx->mc.error = "NFS: not enough memory to parse device name";
+	return -ENOMEM;
+out_hostname:
+	ctx->mc.error = "NFS: server hostname too long";
+	return -ENAMETOOLONG;
+out_path:
+	ctx->mc.error = "NFS: export pathname too long";
+	return -ENAMETOOLONG;
+}
+
+/*
+ * Parse monolithic NFS2/NFS3 mount data
+ * - fills in the mount root filehandle
+ *
+ * For option strings, user space handles the following behaviors:
+ *
+ * + DNS: mapping server host name to IP address ("addr=" option)
+ *
+ * + failure mode: how to behave if a mount request can't be handled
+ *   immediately ("fg/bg" option)
+ *
+ * + retry: how often to retry a mount request ("retry=" option)
+ *
+ * + breaking back: trying proto=udp after proto=tcp, v2 after v3,
+ *   mountproto=tcp after mountproto=udp, and so on
+ */
+static int nfs23_monolithic_mount_data(struct mount_context *mc,
+				       struct nfs_mount_data *data)
+{
+	struct nfs_mount_context *ctx =
+		container_of(mc, struct nfs_mount_context, mc);
+	struct nfs_fh *mntfh = ctx->mntfh;
+	struct sockaddr *sap = (struct sockaddr *)&ctx->nfs_server.address;
+	int extra_flags = NFS_MOUNT_LEGACY_INTERFACE;
+	int ret;
+
+	if (data == NULL)
+		goto out_no_data;
+
+	ctx->version = NFS_DEFAULT_VERSION;
+	switch (data->version) {
+	case 1:
+		data->namlen = 0;
+	case 2:
+		data->bsize = 0;
+	case 3:
+		if (data->flags & NFS_MOUNT_VER3)
+			goto out_no_v3;
+		data->root.size = NFS2_FHSIZE;
+		memcpy(data->root.data, data->old_root.data, NFS2_FHSIZE);
+		/* Turn off security negotiation */
+		extra_flags |= NFS_MOUNT_SECFLAVOUR;
+	case 4:
+		if (data->flags & NFS_MOUNT_SECFLAVOUR)
+			goto out_no_sec;
+	case 5:
+		memset(data->context, 0, sizeof(data->context));
+	case 6:
+		if (data->flags & NFS_MOUNT_VER3) {
+			if (data->root.size > NFS3_FHSIZE || data->root.size == 0)
+				goto out_invalid_fh;
+			mntfh->size = data->root.size;
+			ctx->version = 3;
+		} else {
+			mntfh->size = NFS2_FHSIZE;
+			ctx->version = 2;
+		}
+
+		memcpy(mntfh->data, data->root.data, mntfh->size);
+		if (mntfh->size < sizeof(mntfh->data))
+			memset(mntfh->data + mntfh->size, 0,
+			       sizeof(mntfh->data) - mntfh->size);
+
+		/*
+		 * Translate to nfs_mount_context, which nfs_fill_super
+		 * can deal with.
+		 */
+		ctx->flags	= data->flags & NFS_MOUNT_FLAGMASK;
+		ctx->flags	|= extra_flags;
+		ctx->rsize	= data->rsize;
+		ctx->wsize	= data->wsize;
+		ctx->timeo	= data->timeo;
+		ctx->retrans	= data->retrans;
+		ctx->acregmin	= data->acregmin;
+		ctx->acregmax	= data->acregmax;
+		ctx->acdirmin	= data->acdirmin;
+		ctx->acdirmax	= data->acdirmax;
+		ctx->need_mount	= false;
+
+		memcpy(sap, &data->addr, sizeof(data->addr));
+		ctx->nfs_server.addrlen = sizeof(data->addr);
+		ctx->nfs_server.port = ntohs(data->addr.sin_port);
+		if (!nfs_verify_server_address(sap))
+			goto out_no_address;
+
+		if (!(data->flags & NFS_MOUNT_TCP))
+			ctx->nfs_server.protocol = XPRT_TRANSPORT_UDP;
+		/* N.B. caller will free nfs_server.hostname in all cases */
+		ctx->nfs_server.hostname = kstrdup(data->hostname, GFP_KERNEL);
+		if (!ctx->nfs_server.hostname)
+			goto out_nomem;
+
+		ctx->namlen	= data->namlen;
+		ctx->bsize	= data->bsize;
+
+		if (data->flags & NFS_MOUNT_SECFLAVOUR)
+			ctx->selected_flavor = data->pseudoflavor;
+		else
+			ctx->selected_flavor = RPC_AUTH_UNIX;
+
+		if (!(data->flags & NFS_MOUNT_NONLM))
+			ctx->flags &= ~(NFS_MOUNT_LOCAL_FLOCK |
+					 NFS_MOUNT_LOCAL_FCNTL);
+		else
+			ctx->flags |= (NFS_MOUNT_LOCAL_FLOCK |
+					NFS_MOUNT_LOCAL_FCNTL);
+
+		/* The legacy version 6 binary mount data from userspace has a
+		 * field used only to transport selinux information into the
+		 * the kernel.  To continue to support that functionality we
+		 * have a touch of selinux knowledge here in the NFS code. The
+		 * userspace code converted context=blah to just blah so we are
+		 * converting back to the full string selinux understands.
+		 */
+		if (data->context[0]){
+#ifdef CONFIG_SECURITY_SELINUX
+			char *opts_str = kmalloc(sizeof(data->context) + 8, GFP_KERNEL);
+			if (!opts_str)
+				return -ENOMEM;
+			strcpy(opts_str, "context=");
+			data->context[NFS_MAX_CONTEXT_LEN] = '\0';
+			strcat(opts_str, &data->context[0]);
+			ret = vfs_mount_option(mc, opts_str);
+			kfree(opts_str);
+			if (ret)
+				return ret;
+#else
+			return -EINVAL;
+#endif
+		}
+
+		break;
+	default:
+		return generic_monolithic_mount_data(mc, data);
+	}
+
+	ctx->skip_remount_option_check = true;
+	return 0;
+
+out_no_data:
+	if (mc->ms_flags & MS_REMOUNT) {
+		ctx->skip_remount_option_check = true;
+		return 0;
+	}
+	mc->error = "NFS: mount program didn't pass any mount data";
+	return -EINVAL;
+
+out_no_v3:
+	mc->error = "NFS: nfs_mount_data version does not support v3";
+	return -EINVAL;
+
+out_no_sec:
+	mc->error = "NFS: nfs_mount_data version supports only AUTH_SYS";
+	return -EINVAL;
+
+out_nomem:
+	dfprintk(MOUNT, "NFS: not enough memory to handle mount options");
+	return -ENOMEM;
+
+out_no_address:
+	mc->error = "NFS: mount program didn't pass remote address";
+	return -EINVAL;
+
+out_invalid_fh:
+	mc->error = "NFS: invalid root filehandle";
+	return -EINVAL;
+}
+
+/*
+ * Validate NFSv4 mount options
+ */
+static int nfs4_monolithic_mount_data(struct mount_context *mc,
+				      struct nfs4_mount_data *data)
+{
+	struct nfs_mount_context *ctx =
+		container_of(mc, struct nfs_mount_context, mc);
+	struct sockaddr *sap = (struct sockaddr *)&ctx->nfs_server.address;
+	char *c;
+
+	if (data == NULL)
+		goto out_no_data;
+
+	ctx->version = 4;
+
+	switch (data->version) {
+	case 1:
+		if (data->host_addrlen > sizeof(ctx->nfs_server.address))
+			goto out_no_address;
+		if (data->host_addrlen == 0)
+			goto out_no_address;
+		ctx->nfs_server.addrlen = data->host_addrlen;
+		if (copy_from_user(sap, data->host_addr, data->host_addrlen))
+			return -EFAULT;
+		if (!nfs_verify_server_address(sap))
+			goto out_no_address;
+		ctx->nfs_server.port = ntohs(((struct sockaddr_in *)sap)->sin_port);
+
+		if (data->auth_flavourlen) {
+			rpc_authflavor_t pseudoflavor;
+			if (data->auth_flavourlen > 1)
+				goto out_inval_auth;
+			if (copy_from_user(&pseudoflavor,
+					   data->auth_flavours,
+					   sizeof(pseudoflavor)))
+				return -EFAULT;
+			ctx->selected_flavor = pseudoflavor;
+		} else
+			ctx->selected_flavor = RPC_AUTH_UNIX;
+
+		c = strndup_user(data->hostname.data, NFS4_MAXNAMLEN);
+		if (IS_ERR(c))
+			return PTR_ERR(c);
+		ctx->nfs_server.hostname = c;
+
+		c = strndup_user(data->mnt_path.data, NFS4_MAXPATHLEN);
+		if (IS_ERR(c))
+			return PTR_ERR(c);
+		ctx->nfs_server.export_path = c;
+		dfprintk(MOUNT, "NFS: MNTPATH: '%s'\n", c);
+
+		c = strndup_user(data->client_addr.data, 16);
+		if (IS_ERR(c))
+			return PTR_ERR(c);
+		ctx->client_address = c;
+
+		/*
+		 * Translate to nfs_mount_context, which nfs4_fill_super
+		 * can deal with.
+		 */
+
+		ctx->flags	= data->flags & NFS4_MOUNT_FLAGMASK;
+		ctx->rsize	= data->rsize;
+		ctx->wsize	= data->wsize;
+		ctx->timeo	= data->timeo;
+		ctx->retrans	= data->retrans;
+		ctx->acregmin	= data->acregmin;
+		ctx->acregmax	= data->acregmax;
+		ctx->acdirmin	= data->acdirmin;
+		ctx->acdirmax	= data->acdirmax;
+		ctx->nfs_server.protocol = data->proto;
+		nfs_validate_transport_protocol(ctx);
+		if (ctx->nfs_server.protocol == XPRT_TRANSPORT_UDP)
+			goto out_invalid_transport_udp;
+
+		break;
+	default:
+		return generic_monolithic_mount_data(mc, data);
+	}
+
+	ctx->skip_remount_option_check = true;
+	return 0;
+
+out_no_data:
+	if (mc->ms_flags & MS_REMOUNT) {
+		ctx->skip_remount_option_check = true;
+		return 0;
+	}
+	mc->error = "NFS4: mount program didn't pass any mount data";
+	return -EINVAL;
+
+out_inval_auth:
+	mc->error = "NFS4: Invalid number of RPC auth flavours";
+	return -EINVAL;
+
+out_no_address:
+	mc->error = "NFS4: mount program didn't pass remote address";
+	return -EINVAL;
+
+out_invalid_transport_udp:
+	mc->error = "NFSv4: Unsupported transport protocol udp";
+	return -EINVAL;
+}
+
+/*
+ * Parse a monolithic block of data from sys_mount().
+ */
+static int nfs_monolithic_mount_data(struct mount_context *mc, void *data)
+{
+	if (mc->fs_type == &nfs_fs_type)
+		return nfs23_monolithic_mount_data(mc, data);
+
+#if IS_ENABLED(CONFIG_NFS_V4)
+	if (mc->fs_type == &nfs4_fs_type)
+		return nfs4_monolithic_mount_data(mc, data);
+#endif
+
+	mc->error = "NFS: Unsupported monolithic data version";
+	return -EINVAL;
+}
+
+/*
+ * Parse a single mount option in "key[=val]" form.
+ */
+static int nfs_mount_ctx_option(struct mount_context *mc, char *p)
+{
+	struct nfs_mount_context *mnt =
+		container_of(mc, struct nfs_mount_context, mc);
+	substring_t args[MAX_OPT_ARGS];
+	char *string;
+	int rc = 0, token;
+
+	dfprintk(MOUNT, "NFS:   parsing nfs mount option '%s'\n", p);
+
+	token = match_token(p, nfs_mount_option_tokens, args);
+	switch (token) {
+		/*
+		 * boolean options:  foo/nofoo
+		 */
+	case Opt_soft:
+		mnt->flags |= NFS_MOUNT_SOFT;
+		break;
+	case Opt_hard:
+		mnt->flags &= ~NFS_MOUNT_SOFT;
+		break;
+	case Opt_posix:
+		mnt->flags |= NFS_MOUNT_POSIX;
+		break;
+	case Opt_noposix:
+		mnt->flags &= ~NFS_MOUNT_POSIX;
+		break;
+	case Opt_cto:
+		mnt->flags &= ~NFS_MOUNT_NOCTO;
+		break;
+	case Opt_nocto:
+		mnt->flags |= NFS_MOUNT_NOCTO;
+		break;
+	case Opt_ac:
+		mnt->flags &= ~NFS_MOUNT_NOAC;
+		break;
+	case Opt_noac:
+		mnt->flags |= NFS_MOUNT_NOAC;
+		break;
+	case Opt_lock:
+		mnt->flags &= ~NFS_MOUNT_NONLM;
+		mnt->flags &= ~(NFS_MOUNT_LOCAL_FLOCK | NFS_MOUNT_LOCAL_FCNTL);
+		break;
+	case Opt_nolock:
+		mnt->flags |= NFS_MOUNT_NONLM;
+		mnt->flags |= (NFS_MOUNT_LOCAL_FLOCK | NFS_MOUNT_LOCAL_FCNTL);
+		break;
+	case Opt_udp:
+		mnt->flags &= ~NFS_MOUNT_TCP;
+		mnt->nfs_server.protocol = XPRT_TRANSPORT_UDP;
+		break;
+	case Opt_tcp:
+		mnt->flags |= NFS_MOUNT_TCP;
+		mnt->nfs_server.protocol = XPRT_TRANSPORT_TCP;
+		break;
+	case Opt_rdma:
+		mnt->flags |= NFS_MOUNT_TCP; /* for side protocols */
+		mnt->nfs_server.protocol = XPRT_TRANSPORT_RDMA;
+		xprt_load_transport(p);
+		break;
+	case Opt_acl:
+		mnt->flags &= ~NFS_MOUNT_NOACL;
+		break;
+	case Opt_noacl:
+		mnt->flags |= NFS_MOUNT_NOACL;
+		break;
+	case Opt_rdirplus:
+		mnt->flags &= ~NFS_MOUNT_NORDIRPLUS;
+		break;
+	case Opt_nordirplus:
+		mnt->flags |= NFS_MOUNT_NORDIRPLUS;
+		break;
+	case Opt_sharecache:
+		mnt->flags &= ~NFS_MOUNT_UNSHARED;
+		break;
+	case Opt_nosharecache:
+		mnt->flags |= NFS_MOUNT_UNSHARED;
+		break;
+	case Opt_resvport:
+		mnt->flags &= ~NFS_MOUNT_NORESVPORT;
+		break;
+	case Opt_noresvport:
+		mnt->flags |= NFS_MOUNT_NORESVPORT;
+		break;
+	case Opt_fscache:
+		mnt->options |= NFS_OPTION_FSCACHE;
+		kfree(mnt->fscache_uniq);
+		mnt->fscache_uniq = NULL;
+		break;
+	case Opt_nofscache:
+		mnt->options &= ~NFS_OPTION_FSCACHE;
+		kfree(mnt->fscache_uniq);
+		mnt->fscache_uniq = NULL;
+		break;
+	case Opt_migration:
+		mnt->options |= NFS_OPTION_MIGRATION;
+		break;
+	case Opt_nomigration:
+		mnt->options &= NFS_OPTION_MIGRATION;
+		break;
+
+		/*
+		 * options that take numeric values
+		 */
+	case Opt_port:
+		if (nfs_get_option_ui_bound(mnt, args, &mnt->nfs_server.port,
+					    0, USHRT_MAX))
+			goto out_invalid_value;
+		break;
+	case Opt_rsize:
+		if (nfs_get_option_ui(mnt, args, &mnt->rsize))
+			goto out_invalid_value;
+		break;
+	case Opt_wsize:
+		if (nfs_get_option_ui(mnt, args, &mnt->wsize))
+			goto out_invalid_value;
+		break;
+	case Opt_bsize:
+		if (nfs_get_option_ui(mnt, args, &mnt->bsize))
+			goto out_invalid_value;
+		break;
+	case Opt_timeo:
+		if (nfs_get_option_ui_bound(mnt, args, &mnt->timeo, 1, INT_MAX))
+			goto out_invalid_value;
+		break;
+	case Opt_retrans:
+		if (nfs_get_option_ui_bound(mnt, args, &mnt->retrans, 0, INT_MAX))
+			goto out_invalid_value;
+		break;
+	case Opt_acregmin:
+		if (nfs_get_option_ui(mnt, args, &mnt->acregmin))
+			goto out_invalid_value;
+		break;
+	case Opt_acregmax:
+		if (nfs_get_option_ui(mnt, args, &mnt->acregmax))
+			goto out_invalid_value;
+		break;
+	case Opt_acdirmin:
+		if (nfs_get_option_ui(mnt, args, &mnt->acdirmin))
+			goto out_invalid_value;
+		break;
+	case Opt_acdirmax:
+		if (nfs_get_option_ui(mnt, args, &mnt->acdirmax))
+			goto out_invalid_value;
+		break;
+	case Opt_actimeo:
+		if (nfs_get_option_ui(mnt, args, &mnt->acdirmax))
+			goto out_invalid_value;
+		mnt->acregmin = mnt->acregmax =
+			mnt->acdirmin = mnt->acdirmax;
+		break;
+	case Opt_namelen:
+		if (nfs_get_option_ui(mnt, args, &mnt->namlen))
+			goto out_invalid_value;
+		break;
+	case Opt_mountport:
+		if (nfs_get_option_ui_bound(mnt, args, &mnt->mount_server.port,
+					    0, USHRT_MAX))
+			goto out_invalid_value;
+		break;
+	case Opt_mountvers:
+		if (nfs_get_option_ui_bound(mnt, args, &mnt->mount_server.version,
+					    NFS_MNT_VERSION, NFS_MNT3_VERSION))
+			goto out_invalid_value;
+		break;
+	case Opt_minorversion:
+		if (nfs_get_option_ui_bound(mnt, args, &mnt->minorversion,
+					    0, NFS4_MAX_MINOR_VERSION))
+			goto out_invalid_value;
+		break;
+
+		/*
+		 * options that take text values
+		 */
+	case Opt_nfsvers:
+		string = match_strdup(args);
+		if (string == NULL)
+			goto out_nomem;
+		rc = nfs_parse_version_string(string, mnt, args);
+		kfree(string);
+		if (!rc)
+			goto out_invalid_value;
+		break;
+	case Opt_sec:
+		string = match_strdup(args);
+		if (string == NULL)
+			goto out_nomem;
+		rc = nfs_parse_security_flavors(string, mnt);
+		kfree(string);
+		if (!rc) {
+			dfprintk(MOUNT, "NFS:   unrecognized security flavor\n");
+			return -EINVAL;
+		}
+		break;
+	case Opt_proto:
+		string = match_strdup(args);
+		if (string == NULL)
+			goto out_nomem;
+		token = match_token(string, nfs_xprt_protocol_tokens, args);
+
+		mnt->protofamily = AF_INET;
+		switch (token) {
+		case Opt_xprt_udp6:
+			mnt->protofamily = AF_INET6;
+		case Opt_xprt_udp:
+			mnt->flags &= ~NFS_MOUNT_TCP;
+			mnt->nfs_server.protocol = XPRT_TRANSPORT_UDP;
+			break;
+		case Opt_xprt_tcp6:
+			mnt->protofamily = AF_INET6;
+		case Opt_xprt_tcp:
+			mnt->flags |= NFS_MOUNT_TCP;
+			mnt->nfs_server.protocol = XPRT_TRANSPORT_TCP;
+			break;
+		case Opt_xprt_rdma6:
+			mnt->protofamily = AF_INET6;
+		case Opt_xprt_rdma:
+			/* vector side protocols to TCP */
+			mnt->flags |= NFS_MOUNT_TCP;
+			mnt->nfs_server.protocol = XPRT_TRANSPORT_RDMA;
+			xprt_load_transport(string);
+			break;
+		default:
+			dfprintk(MOUNT, "NFS:   unrecognized transport protocol\n");
+			kfree(string);
+			return -EINVAL;
+		}
+		kfree(string);
+		break;
+	case Opt_mountproto:
+		string = match_strdup(args);
+		if (string == NULL)
+			goto out_nomem;
+		token = match_token(string, nfs_xprt_protocol_tokens, args);
+		kfree(string);
+
+		mnt->mountfamily = AF_INET;
+		switch (token) {
+		case Opt_xprt_udp6:
+			mnt->mountfamily = AF_INET6;
+		case Opt_xprt_udp:
+			mnt->mount_server.protocol = XPRT_TRANSPORT_UDP;
+			break;
+		case Opt_xprt_tcp6:
+			mnt->mountfamily = AF_INET6;
+		case Opt_xprt_tcp:
+			mnt->mount_server.protocol = XPRT_TRANSPORT_TCP;
+			break;
+		case Opt_xprt_rdma: /* not used for side protocols */
+		default:
+			mc->error = "NFS: Unrecognized transport protocol";
+			return -EINVAL;
+		}
+		break;
+	case Opt_addr:
+		string = match_strdup(args);
+		if (string == NULL)
+			goto out_nomem;
+		mnt->nfs_server.addrlen =
+			rpc_pton(mc->net_ns, string, strlen(string),
+				 (struct sockaddr *)&mnt->nfs_server.address,
+				 sizeof(mnt->nfs_server._address));
+		kfree(string);
+		if (mnt->nfs_server.addrlen == 0)
+			goto out_invalid_address;
+		break;
+	case Opt_clientaddr:
+		if (nfs_get_option_str(args, &mnt->client_address))
+			goto out_nomem;
+		break;
+	case Opt_mounthost:
+		if (nfs_get_option_str(args, &mnt->mount_server.hostname))
+			goto out_nomem;
+		break;
+	case Opt_mountaddr:
+		string = match_strdup(args);
+		if (string == NULL)
+			goto out_nomem;
+		mnt->mount_server.addrlen =
+			rpc_pton(mc->net_ns, string, strlen(string),
+				 (struct sockaddr *)&mnt->mount_server.address,
+				 sizeof(mnt->mount_server._address));
+		kfree(string);
+		if (mnt->mount_server.addrlen == 0)
+			goto out_invalid_address;
+		break;
+	case Opt_lookupcache:
+		string = match_strdup(args);
+		if (string == NULL)
+			goto out_nomem;
+		token = match_token(string, nfs_lookupcache_tokens, args);
+		kfree(string);
+		switch (token) {
+		case Opt_lookupcache_all:
+			mnt->flags &= ~(NFS_MOUNT_LOOKUP_CACHE_NONEG|NFS_MOUNT_LOOKUP_CACHE_NONE);
+			break;
+		case Opt_lookupcache_positive:
+			mnt->flags &= ~NFS_MOUNT_LOOKUP_CACHE_NONE;
+			mnt->flags |= NFS_MOUNT_LOOKUP_CACHE_NONEG;
+			break;
+		case Opt_lookupcache_none:
+			mnt->flags |= NFS_MOUNT_LOOKUP_CACHE_NONEG|NFS_MOUNT_LOOKUP_CACHE_NONE;
+			break;
+		default:
+			mc->error = "NFS: Invalid lookupcache argument";
+			return -EINVAL;
+		}
+		break;
+	case Opt_fscache_uniq:
+		if (nfs_get_option_str(args, &mnt->fscache_uniq))
+			goto out_nomem;
+		mnt->options |= NFS_OPTION_FSCACHE;
+		break;
+	case Opt_local_lock:
+		string = match_strdup(args);
+		if (string == NULL)
+			goto out_nomem;
+		token = match_token(string, nfs_local_lock_tokens, args);
+		kfree(string);
+		switch (token) {
+		case Opt_local_lock_all:
+			mnt->flags |= (NFS_MOUNT_LOCAL_FLOCK |
+				       NFS_MOUNT_LOCAL_FCNTL);
+			break;
+		case Opt_local_lock_flock:
+			mnt->flags |= NFS_MOUNT_LOCAL_FLOCK;
+			break;
+		case Opt_local_lock_posix:
+			mnt->flags |= NFS_MOUNT_LOCAL_FCNTL;
+			break;
+		case Opt_local_lock_none:
+			mnt->flags &= ~(NFS_MOUNT_LOCAL_FLOCK |
+					NFS_MOUNT_LOCAL_FCNTL);
+			break;
+		default:
+			mc->error = "NFS: invalid local_lock argument";
+			return -EINVAL;
+		}
+		break;
+
+		/*
+		 * Special options
+		 */
+	case Opt_sloppy:
+		mc->sloppy = 1;
+		dfprintk(MOUNT, "NFS:   relaxing parsing rules\n");
+		break;
+	case Opt_userspace:
+	case Opt_deprecated:
+		dfprintk(MOUNT, "NFS:   ignoring mount option '%s'\n", p);
+		break;
+
+	default:
+		dfprintk(MOUNT, "NFS:   unrecognized mount option '%s'\n", p);
+		if (!mc->sloppy) {
+			mc->error = "NFS: Unrecognized mount option";
+			return -EINVAL;
+		}
+		break;
+	}
+
+	return 0;
+
+out_nomem:
+	printk(KERN_INFO "NFS: not enough memory to parse option\n");
+	return -ENOMEM;
+out_invalid_value:
+	mc->error = "NFS: Bad mount option value specified";
+	return -EINVAL;
+out_invalid_address:
+	mc->error = "NFS: Bad IP address specified";
+	return -EINVAL;
+}
+
+/*
+ * Validate the preparsed information in the mount context.
+ */
+static int nfs_mount_ctx_validate(struct mount_context *mc)
+{
+	struct nfs_mount_context *ctx =
+		container_of(mc, struct nfs_mount_context, mc);
+	struct nfs_subversion *nfs_mod;
+	struct sockaddr *sap = (struct sockaddr *)&ctx->nfs_server.address;
+	int max_namelen = PAGE_SIZE;
+	int max_pathlen = NFS_MAXPATHLEN;
+	int port = 0;
+	int ret;
+
+	if (ctx->mc.mount_type & MOUNT_TYPE_REMOUNT)
+		return 0;
+
+	if (!ctx->mc.device)
+		goto out_no_device_name;
+
+	/* Check for sanity first. */
+	if (ctx->minorversion && ctx->version != 4)
+		goto out_minorversion_mismatch;
+
+	if (ctx->options & NFS_OPTION_MIGRATION &&
+	    (ctx->version != 4 || ctx->minorversion != 0))
+		goto out_migration_misuse;
+
+	/* Verify that any proto=/mountproto= options match the address
+	 * families in the addr=/mountaddr= options.
+	 */
+	if (ctx->protofamily != AF_UNSPEC &&
+	    ctx->protofamily != ctx->nfs_server.address.sa_family)
+		goto out_proto_mismatch;
+
+	if (ctx->mountfamily != AF_UNSPEC) {
+		if (ctx->mount_server.addrlen) {
+			if (ctx->mountfamily != ctx->mount_server.address.sa_family)
+				goto out_mountproto_mismatch;
+		} else {
+			if (ctx->mountfamily != ctx->nfs_server.address.sa_family)
+				goto out_mountproto_mismatch;
+		}
+	}
+
+	if (!nfs_verify_server_address(sap))
+		goto out_no_address;
+
+	if (ctx->version == 4) {
+		if (IS_ENABLED(CONFIG_NFS_V4)) {
+			port = NFS_PORT;
+			max_namelen = NFS4_MAXNAMLEN;
+			max_pathlen = NFS4_MAXPATHLEN;
+			nfs_validate_transport_protocol(ctx);
+			if (ctx->nfs_server.protocol == XPRT_TRANSPORT_UDP)
+				goto out_invalid_transport_udp;
+			ctx->flags &= ~(NFS_MOUNT_NONLM | NFS_MOUNT_NOACL |
+					 NFS_MOUNT_VER3 | NFS_MOUNT_LOCAL_FLOCK |
+					 NFS_MOUNT_LOCAL_FCNTL);
+		} else {
+			goto out_v4_not_compiled;
+		}
+	} else {
+		nfs_set_mount_transport_protocol(ctx);
+	}
+
+	nfs_set_port(sap, &ctx->nfs_server.port, port);
+
+	ret = nfs_parse_devname(ctx, max_namelen, max_pathlen);
+	if (ret < 0)
+		return ret;
+
+	/* Load the NFS protocol module if we haven't done so yet */
+	if (!ctx->nfs_mod) {
+		nfs_mod = get_nfs_version(ctx->version);
+		if (IS_ERR(nfs_mod)) {
+			ret = PTR_ERR(nfs_mod);
+			goto out_version_unavailable;
+		}
+		ctx->nfs_mod = nfs_mod;
+	}
+	return 0;
+
+out_no_device_name:
+	mc->error = "NFS: Device name not specified";
+	return -EINVAL;
+out_v4_not_compiled:
+	mc->error = "NFS: NFSv4 is not compiled into kernel";
+	return -EPROTONOSUPPORT;
+out_invalid_transport_udp:
+	mc->error = "NFSv4: Unsupported transport protocol udp";
+	return -EINVAL;
+out_no_address:
+	mc->error = "NFS: mount program didn't pass remote address";
+	return -EINVAL;
+out_mountproto_mismatch:
+	mc->error = "NFS: Mount server address does not match mountproto= option";
+	return -EINVAL;
+out_proto_mismatch:
+	mc->error = "NFS: Server address does not match proto= option";
+	return -EINVAL;
+out_minorversion_mismatch:
+	mc->error = "NFS: Mount option does not support minorversion";
+	return -EINVAL;
+out_migration_misuse:
+	mc->error = "NFS: 'Migration' not supported for this NFS version";
+	return -EINVAL;
+out_version_unavailable:
+	mc->error = "NFS: Version unavailable";
+	return ret;
+}
+
+/*
+ * Use the preparsed information in the mount context to effect a mount.
+ */
+static struct dentry *nfs_ordinary_mount(struct nfs_mount_context *ctx)
+{
+	ctx->set_security = nfs_set_sb_security;
+
+	return ctx->nfs_mod->rpc_ops->try_mount(ctx);
+}
+
+/*
+ * Clone an NFS2/3/4 server record on xdev traversal (FSID-change)
+ */
+static struct dentry *nfs_xdev_mount(struct nfs_mount_context *ctx)
+{
+	struct nfs_server *server;
+	struct dentry *mntroot = ERR_PTR(-ENOMEM);
+
+	dprintk("--> nfs_xdev_mount()\n");
+
+	ctx->set_security = nfs_clone_sb_security;
+
+	/* create a new volume representation */
+	server = ctx->nfs_mod->rpc_ops->clone_server(NFS_SB(ctx->clone_data.sb),
+						     ctx->mntfh,
+						     ctx->clone_data.fattr,
+						     ctx->selected_flavor);
+
+	if (IS_ERR(server))
+		mntroot = ERR_CAST(server);
+	else
+		mntroot = nfs_fs_mount_common(server, ctx);
+
+	dprintk("<-- nfs_xdev_mount() = %ld\n",
+			IS_ERR(mntroot) ? PTR_ERR(mntroot) : 0L);
+	return mntroot;
+}
+
+/*
+ * Handle ordinary mounts inspired by the user and cross-FSID mounts.
+ */
+struct dentry *nfs_general_mount(struct nfs_mount_context *ctx)
+{
+	switch (ctx->mount_type) {
+	case NFS_MOUNT_ORDINARY:
+		return nfs_ordinary_mount(ctx);
+
+	case NFS_MOUNT_CROSS_DEV:
+		return nfs_xdev_mount(ctx);
+
+	default:
+		ctx->mc.error = "NFS: Unknown mount type";
+		return ERR_PTR(-ENOTSUPP);
+	}
+}
+EXPORT_SYMBOL_GPL(nfs_general_mount);
+
+static struct dentry *nfs_fs_mount(struct mount_context *mc)
+{
+	struct nfs_mount_context *ctx =
+		container_of(mc, struct nfs_mount_context, mc);
+
+	if (!ctx->nfs_mod) {
+		pr_warn("Missing nfs_mod\n");
+		return ERR_PTR(-EINVAL);
+	}
+	if (!ctx->nfs_mod->rpc_ops) {
+		pr_warn("Missing rpc_ops\n");
+		return ERR_PTR(-EINVAL);
+	}
+	if (!ctx->nfs_mod->rpc_ops->mount) {
+		pr_warn("Missing mount\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	return ctx->nfs_mod->rpc_ops->mount(ctx);
+}
+
+/*
+ * Handle duplication of a mount context.  The caller copied *src into *mc, but
+ * it can't deal with resource pointers in the filesystem context, so we have
+ * to do that.  We need to clear pointers, copy data or get extra refs as
+ * appropriate.
+ */
+static int nfs_mount_ctx_dup(struct mount_context *mc, struct mount_context *src)
+{
+	struct nfs_mount_context *ctx =
+		container_of(mc, struct nfs_mount_context, mc);
+
+	__module_get(ctx->nfs_mod->owner);
+	ctx->client_address		= NULL;
+	ctx->mount_server.hostname	= NULL;
+	ctx->nfs_server.export_path	= NULL;
+	ctx->nfs_server.hostname	= NULL;
+	ctx->fscache_uniq		= NULL;
+
+	ctx->mntfh = nfs_alloc_fhandle();
+	if (!ctx->mntfh)
+		return -ENOMEM;
+	return 0;
+}
+
+static void nfs_mount_ctx_free(struct mount_context *mc)
+{
+	struct nfs_mount_context *ctx =
+		container_of(mc, struct nfs_mount_context, mc);
+
+	if (ctx->nfs_mod)
+		put_nfs_version(ctx->nfs_mod);
+	kfree(ctx->client_address);
+	kfree(ctx->mount_server.hostname);
+	if (ctx->nfs_server.export_path != nfs_slash)
+		kfree(ctx->nfs_server.export_path);
+	kfree(ctx->nfs_server.hostname);
+	kfree(ctx->fscache_uniq);
+	nfs_free_fhandle(ctx->mntfh);
+}
+
+static const struct mount_context_operations nfs_mount_ctx_ops = {
+	.free			= nfs_mount_ctx_free,
+	.dup			= nfs_mount_ctx_dup,
+	.option			= nfs_mount_ctx_option,
+	.monolithic_mount_data	= nfs_monolithic_mount_data,
+	.validate		= nfs_mount_ctx_validate,
+	.mount			= nfs_fs_mount,
+	.fill_super		= nfs_fill_super,
+};
+
+/*
+ * Initialise a mount context from an extant superblock for remounting.
+ */
+static int nfs_mount_init_from_sb(struct mount_context *mc,
+				  struct super_block *sb)
+{
+	struct nfs_mount_context *ctx =
+		container_of(mc, struct nfs_mount_context, mc);
+	struct nfs_server *nfss = sb->s_fs_info;
+	struct net *net = nfss->nfs_client->cl_net;
+
+	ctx->flags		= nfss->flags;
+	ctx->rsize		= nfss->rsize;
+	ctx->wsize		= nfss->wsize;
+	ctx->retrans		= nfss->client->cl_timeout->to_retries;
+	ctx->selected_flavor	= nfss->client->cl_auth->au_flavor;
+	ctx->acregmin		= nfss->acregmin / HZ;
+	ctx->acregmax		= nfss->acregmax / HZ;
+	ctx->acdirmin		= nfss->acdirmin / HZ;
+	ctx->acdirmax		= nfss->acdirmax / HZ;
+	ctx->timeo		= 10U * nfss->client->cl_timeout->to_initval / HZ;
+	ctx->nfs_server.port	= nfss->port;
+	ctx->nfs_server.addrlen	= nfss->nfs_client->cl_addrlen;
+	ctx->version		= nfss->nfs_client->rpc_ops->version;
+	ctx->minorversion	= nfss->nfs_client->cl_minorversion;
+
+	memcpy(&ctx->nfs_server.address, &nfss->nfs_client->cl_addr,
+		ctx->nfs_server.addrlen);
+
+	if (ctx->mc.net_ns != net) {
+		put_net(ctx->mc.net_ns);
+		ctx->mc.net_ns = get_net(net);
+	}
+
+	ctx->nfs_mod = nfss->nfs_client->cl_nfs_mod;
+	if (!try_module_get(ctx->nfs_mod->owner)) {
+		ctx->nfs_mod = NULL;
+		mc->error = "NFS: Protocol module not available";
+		return -ENOENT;
+	}
+
+	return 0;
+}
+
+/*
+ * Prepare mount context.  We use the namespaces attached to the context.  This
+ * may be the current process's namespaces, or it may be a container's
+ * namespaces.
+ */
+static int nfs_fs_fsopen(struct mount_context *mc, struct super_block *src_sb)
+{
+	struct nfs_mount_context *ctx =
+		container_of(mc, struct nfs_mount_context, mc);
+
+	ctx->mntfh = nfs_alloc_fhandle();
+	if (!ctx->mntfh)
+		return -ENOMEM;
+
+	ctx->mc.ops		= &nfs_mount_ctx_ops;
+	ctx->mount_type		= NFS_MOUNT_ORDINARY;
+	ctx->protofamily	= AF_UNSPEC;
+	ctx->mountfamily	= AF_UNSPEC;
+	ctx->mount_server.port	= NFS_UNSPEC_PORT;
+
+	if (!src_sb) {
+		ctx->timeo		= NFS_UNSPEC_TIMEO;
+		ctx->retrans		= NFS_UNSPEC_RETRANS;
+		ctx->acregmin		= NFS_DEF_ACREGMIN;
+		ctx->acregmax		= NFS_DEF_ACREGMAX;
+		ctx->acdirmin		= NFS_DEF_ACDIRMIN;
+		ctx->acdirmax		= NFS_DEF_ACDIRMAX;
+		ctx->nfs_server.port	= NFS_UNSPEC_PORT;
+		ctx->nfs_server.protocol = XPRT_TRANSPORT_TCP;
+		ctx->selected_flavor	= RPC_AUTH_MAXFLAVOR;
+		ctx->minorversion	= 0;
+		ctx->need_mount		= true;
+		return 0;
+	}
+
+	return nfs_mount_init_from_sb(mc, src_sb);
+}
+
+struct file_system_type nfs_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "nfs",
+	.fsopen		= nfs_fs_fsopen,
+	.mc_size	= sizeof(struct nfs_mount_context),
+	.kill_sb	= nfs_kill_super,
+	.fs_flags	= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
+};
+MODULE_ALIAS_FS("nfs");
+EXPORT_SYMBOL_GPL(nfs_fs_type);
+
+#if IS_ENABLED(CONFIG_NFS_V4)
+struct file_system_type nfs4_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "nfs4",
+	.fsopen		= nfs_fs_fsopen,
+	.mc_size	= sizeof(struct nfs_mount_context),
+	.kill_sb	= nfs_kill_super,
+	.fs_flags	= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
+};
+MODULE_ALIAS_FS("nfs4");
+MODULE_ALIAS("nfs4");
+EXPORT_SYMBOL_GPL(nfs4_fs_type);
+#endif /* CONFIG_NFS_V4 */
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index 786f17580582..9be008f7072e 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -18,6 +18,7 @@
 #include <linux/vfs.h>
 #include <linux/sunrpc/gss_api.h>
 #include "internal.h"
+#include "nfs.h"
 
 #define NFSDBG_FACILITY		NFSDBG_VFS
 
@@ -224,10 +225,9 @@ void nfs_release_automount_timer(void)
  * Clone a mountpoint of the appropriate type
  */
 static struct vfsmount *nfs_do_clone_mount(struct nfs_server *server,
-					   const char *devname,
-					   struct nfs_clone_mount *mountdata)
+					   struct nfs_mount_context *ctx)
 {
-	return vfs_submount(mountdata->dentry, &nfs_xdev_fs_type, devname, mountdata);
+	return vfs_submount_mc(ctx->clone_data.dentry, &ctx->mc);
 }
 
 /**
@@ -241,33 +241,56 @@ static struct vfsmount *nfs_do_clone_mount(struct nfs_server *server,
 struct vfsmount *nfs_do_submount(struct dentry *dentry, struct nfs_fh *fh,
 				 struct nfs_fattr *fattr, rpc_authflavor_t authflavor)
 {
-	struct nfs_clone_mount mountdata = {
-		.sb = dentry->d_sb,
-		.dentry = dentry,
-		.fh = fh,
-		.fattr = fattr,
-		.authflavor = authflavor,
-	};
-	struct vfsmount *mnt = ERR_PTR(-ENOMEM);
-	char *page = (char *) __get_free_page(GFP_USER);
-	char *devname;
+	struct nfs_mount_context *ctx;
+	struct mount_context *mc;
+	struct vfsmount *mnt;
+	char *buffer, *p;
+
+	/* Open a new mount context, transferring parameters from the parent
+	 * superblock, including the network namespace.
+	 */
+	mc = __vfs_fsopen(&nfs_fs_type, dentry->d_sb, 0, 0, MOUNT_TYPE_SUBMOUNT);
+	if (IS_ERR(mc))
+		return ERR_CAST(mc);
+	ctx = container_of(mc, struct nfs_mount_context, mc);
+
+	mnt = ERR_PTR(-ENOMEM);
+	buffer = kmalloc(4096, GFP_USER);
+	if (!buffer)
+		goto err_mc;
+
+	ctx->mount_type		= NFS_MOUNT_CROSS_DEV;
+	ctx->selected_flavor	= authflavor;
+	ctx->clone_data.sb	= dentry->d_sb;
+	ctx->clone_data.dentry	= dentry;
+	ctx->clone_data.fattr	= fattr;
+	ctx->clone_data.cloned	= true;
+
+	nfs_copy_fh(ctx->mntfh, fh);
 
 	dprintk("--> nfs_do_submount()\n");
+	dprintk("%s: submounting on %pd2\n", __func__, dentry);
 
-	dprintk("%s: submounting on %pd2\n", __func__,
-			dentry);
-	if (page == NULL)
-		goto out;
-	devname = nfs_devname(dentry, page, PAGE_SIZE);
-	mnt = (struct vfsmount *)devname;
-	if (IS_ERR(devname))
-		goto free_page;
-	mnt = nfs_do_clone_mount(NFS_SB(dentry->d_sb), devname, &mountdata);
-free_page:
-	free_page((unsigned long)page);
-out:
-	dprintk("%s: done\n", __func__);
+	p = nfs_devname(dentry, buffer, 4096);
+	if (IS_ERR(p)) {
+		mc->error = "NFS: Couldn't determine submount pathname";
+		mnt = ERR_CAST(p);
+		goto err_buffer;
+	}
 
+	ctx->mc.device = kmemdup(p, buffer + 4096 - p, GFP_KERNEL);
+	if (!ctx->mc.device)
+		goto err_buffer;
+	kfree(buffer);
+
+	mnt = nfs_do_clone_mount(NFS_SB(dentry->d_sb), ctx);
+	goto err_mc;
+
+err_buffer:
+	kfree(buffer);
+err_mc:
+	put_mount_context(mc);
+	dprintk("%s: done\n", __func__);
 	dprintk("<-- nfs_do_submount() = %p\n", mnt);
 	return mnt;
 }
diff --git a/fs/nfs/nfs3_fs.h b/fs/nfs/nfs3_fs.h
index e134d6548ab7..5f9902dfb54e 100644
--- a/fs/nfs/nfs3_fs.h
+++ b/fs/nfs/nfs3_fs.h
@@ -26,7 +26,7 @@ static inline int nfs3_proc_setacls(struct inode *inode, struct posix_acl *acl,
 #endif /* CONFIG_NFS_V3_ACL */
 
 /* nfs3client.c */
-struct nfs_server *nfs3_create_server(struct nfs_mount_info *, struct nfs_subversion *);
+struct nfs_server *nfs3_create_server(struct nfs_mount_context *);
 struct nfs_server *nfs3_clone_server(struct nfs_server *, struct nfs_fh *,
 				     struct nfs_fattr *, rpc_authflavor_t);
 
diff --git a/fs/nfs/nfs3client.c b/fs/nfs/nfs3client.c
index 7879f2a0fcfd..ce54664e12b5 100644
--- a/fs/nfs/nfs3client.c
+++ b/fs/nfs/nfs3client.c
@@ -45,10 +45,10 @@ static inline void nfs_init_server_aclclient(struct nfs_server *server)
 }
 #endif
 
-struct nfs_server *nfs3_create_server(struct nfs_mount_info *mount_info,
-				      struct nfs_subversion *nfs_mod)
+struct nfs_server *nfs3_create_server(struct nfs_mount_context *ctx)
 {
-	struct nfs_server *server = nfs_create_server(mount_info, nfs_mod);
+	struct nfs_server *server = nfs_create_server(ctx);
+
 	/* Create a client RPC handle for the NFS v3 ACL management interface */
 	if (!IS_ERR(server))
 		nfs_init_server_aclclient(server);
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index dc925b531f32..1e0349fc33ec 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -922,6 +922,7 @@ const struct nfs_rpc_ops nfs_v3_clientops = {
 	.file_inode_ops	= &nfs3_file_inode_operations,
 	.file_ops	= &nfs_file_operations,
 	.getroot	= nfs3_proc_get_root,
+	.mount		= nfs_general_mount,
 	.submount	= nfs_submount,
 	.try_mount	= nfs_try_mount,
 	.getattr	= nfs3_proc_getattr,
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index af285cc27ccf..e85429690089 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -467,7 +467,6 @@ extern const nfs4_stateid zero_stateid;
 /* nfs4super.c */
 struct nfs_mount_info;
 extern struct nfs_subversion nfs_v4;
-struct dentry *nfs4_try_mount(int, const char *, struct nfs_mount_info *, struct nfs_subversion *);
 extern bool nfs4_disable_idmapping;
 extern unsigned short max_session_slots;
 extern unsigned short max_session_cb_slots;
@@ -477,6 +476,9 @@ extern bool recover_lost_locks;
 #define NFS4_CLIENT_ID_UNIQ_LEN		(64)
 extern char nfs4_client_id_uniquifier[NFS4_CLIENT_ID_UNIQ_LEN];
 
+extern struct dentry *nfs4_try_mount(struct nfs_mount_context *);
+extern struct dentry *nfs4_mount(struct nfs_mount_context *);
+
 /* nfs4sysctl.c */
 #ifdef CONFIG_SYSCTL
 int nfs4_register_sysctl(void);
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 8346ccbf2d52..041d79b4c962 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -1093,56 +1093,56 @@ static int nfs4_server_common_setup(struct nfs_server *server,
  * Create a version 4 volume record
  */
 static int nfs4_init_server(struct nfs_server *server,
-		struct nfs_parsed_mount_data *data)
+			    struct nfs_mount_context *ctx)
 {
 	struct rpc_timeout timeparms;
 	int error;
 
 	dprintk("--> nfs4_init_server()\n");
 
-	nfs_init_timeout_values(&timeparms, data->nfs_server.protocol,
-			data->timeo, data->retrans);
+	nfs_init_timeout_values(&timeparms, ctx->nfs_server.protocol,
+			ctx->timeo, ctx->retrans);
 
 	/* Initialise the client representation from the mount data */
-	server->flags = data->flags;
-	server->options = data->options;
-	server->auth_info = data->auth_info;
+	server->flags = ctx->flags;
+	server->options = ctx->options;
+	server->auth_info = ctx->auth_info;
 
 	/* Use the first specified auth flavor. If this flavor isn't
 	 * allowed by the server, use the SECINFO path to try the
 	 * other specified flavors */
-	if (data->auth_info.flavor_len >= 1)
-		data->selected_flavor = data->auth_info.flavors[0];
+	if (ctx->auth_info.flavor_len >= 1)
+		ctx->selected_flavor = ctx->auth_info.flavors[0];
 	else
-		data->selected_flavor = RPC_AUTH_UNIX;
+		ctx->selected_flavor = RPC_AUTH_UNIX;
 
 	/* Get a client record */
 	error = nfs4_set_client(server,
-			data->nfs_server.hostname,
-			(const struct sockaddr *)&data->nfs_server.address,
-			data->nfs_server.addrlen,
-			data->client_address,
-			data->nfs_server.protocol,
+			ctx->nfs_server.hostname,
+			(const struct sockaddr *)&ctx->nfs_server.address,
+			ctx->nfs_server.addrlen,
+			ctx->client_address,
+			ctx->nfs_server.protocol,
 			&timeparms,
-			data->minorversion,
-			data->net);
+			ctx->minorversion,
+			ctx->mc.net_ns);
 	if (error < 0)
 		goto error;
 
-	if (data->rsize)
-		server->rsize = nfs_block_size(data->rsize, NULL);
-	if (data->wsize)
-		server->wsize = nfs_block_size(data->wsize, NULL);
+	if (ctx->rsize)
+		server->rsize = nfs_block_size(ctx->rsize, NULL);
+	if (ctx->wsize)
+		server->wsize = nfs_block_size(ctx->wsize, NULL);
 
-	server->acregmin = data->acregmin * HZ;
-	server->acregmax = data->acregmax * HZ;
-	server->acdirmin = data->acdirmin * HZ;
-	server->acdirmax = data->acdirmax * HZ;
+	server->acregmin = ctx->acregmin * HZ;
+	server->acregmax = ctx->acregmax * HZ;
+	server->acdirmin = ctx->acdirmin * HZ;
+	server->acdirmax = ctx->acdirmax * HZ;
 
-	server->port = data->nfs_server.port;
+	server->port = ctx->nfs_server.port;
 
 	error = nfs_init_server_rpcclient(server, &timeparms,
-					  data->selected_flavor);
+					  ctx->selected_flavor);
 
 error:
 	/* Done */
@@ -1154,10 +1154,7 @@ static int nfs4_init_server(struct nfs_server *server,
  * Create a version 4 volume record
  * - keyed on server and FSID
  */
-/*struct nfs_server *nfs4_create_server(const struct nfs_parsed_mount_data *data,
-				      struct nfs_fh *mntfh)*/
-struct nfs_server *nfs4_create_server(struct nfs_mount_info *mount_info,
-				      struct nfs_subversion *nfs_mod)
+struct nfs_server *nfs4_create_server(struct nfs_mount_context *ctx)
 {
 	struct nfs_server *server;
 	bool auth_probe;
@@ -1169,14 +1166,14 @@ struct nfs_server *nfs4_create_server(struct nfs_mount_info *mount_info,
 	if (!server)
 		return ERR_PTR(-ENOMEM);
 
-	auth_probe = mount_info->parsed->auth_info.flavor_len < 1;
+	auth_probe = ctx->auth_info.flavor_len < 1;
 
 	/* set up the general RPC client */
-	error = nfs4_init_server(server, mount_info->parsed);
+	error = nfs4_init_server(server, ctx);
 	if (error < 0)
 		goto error;
 
-	error = nfs4_server_common_setup(server, mount_info->mntfh, auth_probe);
+	error = nfs4_server_common_setup(server, ctx->mntfh, auth_probe);
 	if (error < 0)
 		goto error;
 
@@ -1192,8 +1189,7 @@ struct nfs_server *nfs4_create_server(struct nfs_mount_info *mount_info,
 /*
  * Create an NFS4 referral server record
  */
-struct nfs_server *nfs4_create_referral_server(struct nfs_clone_mount *data,
-					       struct nfs_fh *mntfh)
+struct nfs_server *nfs4_create_referral_server(struct nfs_mount_context *ctx)
 {
 	struct nfs_client *parent_client;
 	struct nfs_server *server, *parent_server;
@@ -1206,7 +1202,7 @@ struct nfs_server *nfs4_create_referral_server(struct nfs_clone_mount *data,
 	if (!server)
 		return ERR_PTR(-ENOMEM);
 
-	parent_server = NFS_SB(data->sb);
+	parent_server = NFS_SB(ctx->clone_data.sb);
 	parent_client = parent_server->nfs_client;
 
 	/* Initialise the client representation from the parent server */
@@ -1214,9 +1210,10 @@ struct nfs_server *nfs4_create_referral_server(struct nfs_clone_mount *data,
 
 	/* Get a client representation.
 	 * Note: NFSv4 always uses TCP, */
-	error = nfs4_set_client(server, data->hostname,
-				data->addr,
-				data->addrlen,
+	error = nfs4_set_client(server,
+				ctx->nfs_server.hostname,
+				(const struct sockaddr *)&ctx->nfs_server.address,
+				ctx->nfs_server.addrlen,
 				parent_client->cl_ipaddr,
 				rpc_protocol(parent_server->client),
 				parent_server->client->cl_timeout,
@@ -1225,13 +1222,14 @@ struct nfs_server *nfs4_create_referral_server(struct nfs_clone_mount *data,
 	if (error < 0)
 		goto error;
 
-	error = nfs_init_server_rpcclient(server, parent_server->client->cl_timeout, data->authflavor);
+	error = nfs_init_server_rpcclient(server, parent_server->client->cl_timeout,
+					  ctx->selected_flavor);
 	if (error < 0)
 		goto error;
 
 	auth_probe = parent_server->auth_info.flavor_len < 1;
 
-	error = nfs4_server_common_setup(server, mntfh, auth_probe);
+	error = nfs4_server_common_setup(server, ctx->mntfh, auth_probe);
 	if (error < 0)
 		goto error;
 
diff --git a/fs/nfs/nfs4namespace.c b/fs/nfs/nfs4namespace.c
index d8b040bd9814..2f1af5bdf93a 100644
--- a/fs/nfs/nfs4namespace.c
+++ b/fs/nfs/nfs4namespace.c
@@ -7,6 +7,7 @@
  * NFSv4 namespace
  */
 
+#include <linux/module.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/namei.h>
@@ -20,37 +21,64 @@
 #include <linux/inet.h>
 #include "internal.h"
 #include "nfs4_fs.h"
+#include "nfs.h"
 #include "dns_resolve.h"
 
 #define NFSDBG_FACILITY		NFSDBG_VFS
 
 /*
+ * Work out the length that an NFSv4 path would render to as a standard posix
+ * path, with a leading slash but no terminating slash.
+ */
+static ssize_t nfs4_pathname_len(const struct nfs4_pathname *pathname)
+{
+	ssize_t len;
+	int i;
+
+	for (i = 0; i < pathname->ncomponents; i++) {
+		const struct nfs4_string *component = &pathname->components[i];
+
+		if (component->len > NAME_MAX)
+			goto too_long;
+		len += 1 + component->len; /* Adding "/foo" */
+		if (len > PATH_MAX)
+			goto too_long;
+	}
+	return len;
+
+too_long:
+	return -ENAMETOOLONG;
+}
+
+/*
  * Convert the NFSv4 pathname components into a standard posix path.
- *
- * Note that the resulting string will be placed at the end of the buffer
  */
-static inline char *nfs4_pathname_string(const struct nfs4_pathname *pathname,
-					 char *buffer, ssize_t buflen)
+static char *nfs4_pathname_string(const struct nfs4_pathname *pathname,
+				  unsigned short *_len)
 {
-	char *end = buffer + buflen;
-	int n;
+	ssize_t len;
+	char *buf, *p;
+	int i;
 
-	*--end = '\0';
-	buflen--;
-
-	n = pathname->ncomponents;
-	while (--n >= 0) {
-		const struct nfs4_string *component = &pathname->components[n];
-		buflen -= component->len + 1;
-		if (buflen < 0)
-			goto Elong;
-		end -= component->len;
-		memcpy(end, component->data, component->len);
-		*--end = '/';
+	len = nfs4_pathname_len(pathname);
+	if (len < 0)
+		return ERR_PTR(len);
+	*_len = len;
+
+	p = buf = kmalloc(len + 1, GFP_KERNEL);
+	if (!buf)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < pathname->ncomponents; i++) {
+		const struct nfs4_string *component = &pathname->components[i];
+
+		*p++ = '/';
+		memcpy(p, component->data, component->len);
+		p += component->len;
 	}
-	return end;
-Elong:
-	return ERR_PTR(-ENAMETOOLONG);
+
+	*p = 0;
+	return buf;
 }
 
 /*
@@ -99,21 +127,25 @@ static char *nfs4_path(struct dentry *dentry, char *buffer, ssize_t buflen)
  */
 static int nfs4_validate_fspath(struct dentry *dentry,
 				const struct nfs4_fs_locations *locations,
-				char *page, char *page2)
+				struct nfs_mount_context *ctx)
 {
-	const char *path, *fs_path;
+	const char *path;
+	char *buf;
+	int n;
 
-	path = nfs4_path(dentry, page, PAGE_SIZE);
-	if (IS_ERR(path))
+	buf = kmalloc(4096, GFP_KERNEL);
+	path = nfs4_path(dentry, buf, 4096);
+	if (IS_ERR(path)) {
+		kfree(buf);
 		return PTR_ERR(path);
+	}
 
-	fs_path = nfs4_pathname_string(&locations->fs_path, page2, PAGE_SIZE);
-	if (IS_ERR(fs_path))
-		return PTR_ERR(fs_path);
-
-	if (strncmp(path, fs_path, strlen(fs_path)) != 0) {
+	n = strncmp(path, ctx->nfs_server.export_path,
+		    ctx->nfs_server.export_path_len);
+	kfree(buf);
+	if (n != 0) {
 		dprintk("%s: path %s does not begin with fsroot %s\n",
-			__func__, path, fs_path);
+			__func__, path, ctx->nfs_server.export_path);
 		return -ENOENT;
 	}
 
@@ -234,56 +266,66 @@ nfs4_negotiate_security(struct rpc_clnt *clnt, struct inode *inode,
 	return new;
 }
 
-static struct vfsmount *try_location(struct nfs_clone_mount *mountdata,
-				     char *page, char *page2,
+static struct vfsmount *try_location(struct dentry *dentry,
+				     struct nfs_mount_context *ctx,
 				     const struct nfs4_fs_location *location)
 {
-	const size_t addr_bufsize = sizeof(struct sockaddr_storage);
-	struct net *net = rpc_net_ns(NFS_SB(mountdata->sb)->client);
+	struct net *net = rpc_net_ns(NFS_SB(dentry->d_sb)->client);
 	struct vfsmount *mnt = ERR_PTR(-ENOENT);
-	char *mnt_path;
-	unsigned int maxbuflen;
-	unsigned int s;
+	unsigned int len, s;
+	char *p;
 
-	mnt_path = nfs4_pathname_string(&location->rootpath, page2, PAGE_SIZE);
-	if (IS_ERR(mnt_path))
-		return ERR_CAST(mnt_path);
-	mountdata->mnt_path = mnt_path;
-	maxbuflen = mnt_path - 1 - page2;
+	/* Allocate a buffer big enough to hold any of the hostnames plus a
+	 * terminating char and also a buffer big enough to hold the hostname
+	 * plus a colon plus the path.
+	 */
+	len = 0;
+	for (s = 0; s < location->nservers; s++) {
+		const struct nfs4_string *buf = &location->servers[s];
+		if (buf->len > len)
+			len = buf->len;
+	}
 
-	mountdata->addr = kmalloc(addr_bufsize, GFP_KERNEL);
-	if (mountdata->addr == NULL)
+	ctx->nfs_server.hostname = kmalloc(len + 1, GFP_KERNEL);
+	if (!ctx->nfs_server.hostname)
 		return ERR_PTR(-ENOMEM);
 
+	ctx->mc.device = kmalloc(len + 1 + ctx->nfs_server.export_path_len + 1,
+				 GFP_KERNEL);
+	if (!ctx->mc.device)
+		return ERR_PTR(-ENOMEM);
+	
 	for (s = 0; s < location->nservers; s++) {
 		const struct nfs4_string *buf = &location->servers[s];
 
-		if (buf->len <= 0 || buf->len >= maxbuflen)
-			continue;
-
 		if (memchr(buf->data, IPV6_SCOPE_DELIMITER, buf->len))
 			continue;
 
-		mountdata->addrlen = nfs_parse_server_name(buf->data, buf->len,
-				mountdata->addr, addr_bufsize, net);
-		if (mountdata->addrlen == 0)
+		ctx->nfs_server.addrlen =
+			nfs_parse_server_name(buf->data, buf->len,
+					      &ctx->nfs_server.address,
+					      sizeof(ctx->nfs_server._address),
+					      net);
+		if (ctx->nfs_server.addrlen == 0)
 			continue;
 
-		rpc_set_port(mountdata->addr, NFS_PORT);
+		rpc_set_port(&ctx->nfs_server.address, NFS_PORT);
 
-		memcpy(page2, buf->data, buf->len);
-		page2[buf->len] = '\0';
-		mountdata->hostname = page2;
+		memcpy(ctx->nfs_server.hostname, buf->data, buf->len);
+		ctx->nfs_server.hostname[buf->len] = '\0';
 
-		snprintf(page, PAGE_SIZE, "%s:%s",
-				mountdata->hostname,
-				mountdata->mnt_path);
+		p = ctx->mc.device;
+		memcpy(p, buf->data, buf->len);
+		p += buf->len;
+		*p++ = ':';
+		memcpy(p, ctx->nfs_server.export_path, ctx->nfs_server.export_path_len);
+		p += ctx->nfs_server.export_path_len;
+		*p = 0;
 
-		mnt = vfs_submount(mountdata->dentry, &nfs4_referral_fs_type, page, mountdata);
+		mnt = vfs_submount_mc(ctx->clone_data.dentry, &ctx->mc);
 		if (!IS_ERR(mnt))
 			break;
 	}
-	kfree(mountdata->addr);
 	return mnt;
 }
 
@@ -296,33 +338,42 @@ static struct vfsmount *try_location(struct nfs_clone_mount *mountdata,
 static struct vfsmount *nfs_follow_referral(struct dentry *dentry,
 					    const struct nfs4_fs_locations *locations)
 {
-	struct vfsmount *mnt = ERR_PTR(-ENOENT);
-	struct nfs_clone_mount mountdata = {
-		.sb = dentry->d_sb,
-		.dentry = dentry,
-		.authflavor = NFS_SB(dentry->d_sb)->client->cl_auth->au_flavor,
-	};
-	char *page = NULL, *page2 = NULL;
+	struct nfs_mount_context *ctx;
+	struct mount_context *mc;
+	struct vfsmount *mnt;
+	char *export_path;
 	int loc, error;
 
 	if (locations == NULL || locations->nlocations <= 0)
 		goto out;
 
+	mc = __vfs_fsopen(&nfs4_fs_type, dentry->d_sb, 0, 0, MOUNT_TYPE_SUBMOUNT);
+	if (IS_ERR(mc)) {
+		mnt = ERR_CAST(mc);
+		goto out;
+	}
+	ctx = container_of(mc, struct nfs_mount_context, mc);
+
 	dprintk("%s: referral at %pd2\n", __func__, dentry);
 
-	page = (char *) __get_free_page(GFP_USER);
-	if (!page)
-		goto out;
+	ctx->mount_type		= NFS4_MOUNT_REFERRAL;
+	ctx->clone_data.sb	= dentry->d_sb;
+	ctx->clone_data.dentry	= dentry;
+	ctx->clone_data.cloned	= true;
 
-	page2 = (char *) __get_free_page(GFP_USER);
-	if (!page2)
-		goto out;
+	export_path = nfs4_pathname_string(&locations->fs_path,
+					   &ctx->nfs_server.export_path_len);
+	if (IS_ERR(export_path)) {
+		mnt = ERR_CAST(export_path);
+		goto out_mc;
+	}
+	ctx->nfs_server.export_path = export_path;
 
 	/* Ensure fs path is a prefix of current dentry path */
-	error = nfs4_validate_fspath(dentry, locations, page, page2);
+	error = nfs4_validate_fspath(dentry, locations, ctx);
 	if (error < 0) {
 		mnt = ERR_PTR(error);
-		goto out;
+		goto out_mc;
 	}
 
 	for (loc = 0; loc < locations->nlocations; loc++) {
@@ -332,14 +383,14 @@ static struct vfsmount *nfs_follow_referral(struct dentry *dentry,
 		    location->rootpath.ncomponents == 0)
 			continue;
 
-		mnt = try_location(&mountdata, page, page2, location);
+		mnt = try_location(ctx->clone_data.dentry, ctx, location);
 		if (!IS_ERR(mnt))
 			break;
 	}
 
+out_mc:
+	put_mount_context(mc);
 out:
-	free_page((unsigned long) page);
-	free_page((unsigned long) page2);
 	dprintk("%s: done\n", __func__);
 	return mnt;
 }
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 201ca3f2c4ba..32d8c10bc45e 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -9318,6 +9318,7 @@ const struct nfs_rpc_ops nfs_v4_clientops = {
 	.file_inode_ops	= &nfs4_file_inode_operations,
 	.file_ops	= &nfs4_file_operations,
 	.getroot	= nfs4_proc_get_root,
+	.mount		= nfs4_mount,
 	.submount	= nfs4_submount,
 	.try_mount	= nfs4_try_mount,
 	.getattr	= nfs4_proc_getattr,
diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
index 6fb7cb6b3f4b..a71cea1ff0e1 100644
--- a/fs/nfs/nfs4super.c
+++ b/fs/nfs/nfs4super.c
@@ -17,36 +17,9 @@
 
 static int nfs4_write_inode(struct inode *inode, struct writeback_control *wbc);
 static void nfs4_evict_inode(struct inode *inode);
-static struct dentry *nfs4_remote_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *raw_data);
-static struct dentry *nfs4_referral_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *raw_data);
-static struct dentry *nfs4_remote_referral_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *raw_data);
-
-static struct file_system_type nfs4_remote_fs_type = {
-	.owner		= THIS_MODULE,
-	.name		= "nfs4",
-	.mount		= nfs4_remote_mount,
-	.kill_sb	= nfs_kill_super,
-	.fs_flags	= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
-};
-
-static struct file_system_type nfs4_remote_referral_fs_type = {
-	.owner		= THIS_MODULE,
-	.name		= "nfs4",
-	.mount		= nfs4_remote_referral_mount,
-	.kill_sb	= nfs_kill_super,
-	.fs_flags	= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
-};
-
-struct file_system_type nfs4_referral_fs_type = {
-	.owner		= THIS_MODULE,
-	.name		= "nfs4",
-	.mount		= nfs4_referral_mount,
-	.kill_sb	= nfs_kill_super,
-	.fs_flags	= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
-};
+static struct dentry *nfs4_remote_mount(struct nfs_mount_context *ctx);
+static struct dentry *nfs4_referral_mount(struct nfs_mount_context *ctx);
+static struct dentry *nfs4_remote_referral_mount(struct nfs_mount_context *ctx);
 
 static const struct super_operations nfs4_sops = {
 	.alloc_inode	= nfs_alloc_inode,
@@ -60,16 +33,16 @@ static const struct super_operations nfs4_sops = {
 	.show_devname	= nfs_show_devname,
 	.show_path	= nfs_show_path,
 	.show_stats	= nfs_show_stats,
-	.remount_fs	= nfs_remount,
+	.remount_fs_mc	= nfs_remount,
 };
 
 struct nfs_subversion nfs_v4 = {
-	.owner = THIS_MODULE,
-	.nfs_fs   = &nfs4_fs_type,
-	.rpc_vers = &nfs_version4,
-	.rpc_ops  = &nfs_v4_clientops,
-	.sops     = &nfs4_sops,
-	.xattr    = nfs4_xattr_handlers,
+	.owner		= THIS_MODULE,
+	.nfs_fs		= &nfs4_fs_type,
+	.rpc_vers	= &nfs_version4,
+	.rpc_ops	= &nfs_v4_clientops,
+	.sops		= &nfs4_sops,
+	.xattr		= nfs4_xattr_handlers,
 };
 
 static int nfs4_write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -103,47 +76,58 @@ static void nfs4_evict_inode(struct inode *inode)
 /*
  * Get the superblock for the NFS4 root partition
  */
-static struct dentry *
-nfs4_remote_mount(struct file_system_type *fs_type, int flags,
-		  const char *dev_name, void *info)
+static struct dentry *nfs4_remote_mount(struct nfs_mount_context *ctx)
 {
-	struct nfs_mount_info *mount_info = info;
 	struct nfs_server *server;
-	struct dentry *mntroot = ERR_PTR(-ENOMEM);
 
-	mount_info->set_security = nfs_set_sb_security;
+	ctx->set_security = nfs_set_sb_security;
 
 	/* Get a volume representation */
-	server = nfs4_create_server(mount_info, &nfs_v4);
-	if (IS_ERR(server)) {
-		mntroot = ERR_CAST(server);
-		goto out;
-	}
-
-	mntroot = nfs_fs_mount_common(server, flags, dev_name, mount_info, &nfs_v4);
+	server = nfs4_create_server(ctx);
+	if (IS_ERR(server))
+		return ERR_CAST(server);
 
-out:
-	return mntroot;
+	return nfs_fs_mount_common(server, ctx);
 }
 
-static struct vfsmount *nfs_do_root_mount(struct file_system_type *fs_type,
-		int flags, void *data, const char *hostname)
+/*
+ * Create a mount for the root of the server.  We copy the mount context we
+ * have for the parameters and set its hostname, path and type.
+ */
+static struct vfsmount *nfs_do_root_mount(struct nfs_mount_context *ctx,
+					  const char *hostname,
+					  enum nfs_mount_type type)
 {
+	struct nfs_mount_context *root_ctx;
+	struct mount_context *root_mc;
 	struct vfsmount *root_mnt;
 	char *root_devname;
 	size_t len;
 
+	root_mc = vfs_dup_mount_context(&ctx->mc);
+	if (IS_ERR(root_mc))
+		return ERR_CAST(root_mc);
+	root_ctx = container_of(root_mc, struct nfs_mount_context, mc);
+
+	root_ctx->mount_type = type;
+	root_ctx->nfs_server.export_path = (char *)nfs_slash;
+
 	len = strlen(hostname) + 5;
+	root_mnt = ERR_PTR(-ENOMEM);
 	root_devname = kmalloc(len, GFP_KERNEL);
 	if (root_devname == NULL)
-		return ERR_PTR(-ENOMEM);
+		goto out_mc;
+
 	/* Does hostname needs to be enclosed in brackets? */
 	if (strchr(hostname, ':'))
 		snprintf(root_devname, len, "[%s]:/", hostname);
 	else
 		snprintf(root_devname, len, "%s:/", hostname);
-	root_mnt = vfs_kern_mount(fs_type, flags, root_devname, data);
-	kfree(root_devname);
+	root_ctx->mc.device = root_devname;
+
+	root_mnt = vfs_kern_mount_mc(&root_ctx->mc);
+out_mc:
+	put_mount_context(root_mc);
 	return root_mnt;
 }
 
@@ -234,24 +218,24 @@ static struct dentry *nfs_follow_remote_path(struct vfsmount *root_mnt,
 	return dentry;
 }
 
-struct dentry *nfs4_try_mount(int flags, const char *dev_name,
-			      struct nfs_mount_info *mount_info,
-			      struct nfs_subversion *nfs_mod)
+struct dentry *nfs4_try_mount(struct nfs_mount_context *ctx)
 {
-	char *export_path;
 	struct vfsmount *root_mnt;
 	struct dentry *res;
-	struct nfs_parsed_mount_data *data = mount_info->parsed;
 
 	dfprintk(MOUNT, "--> nfs4_try_mount()\n");
 
-	export_path = data->nfs_server.export_path;
-	data->nfs_server.export_path = "/";
-	root_mnt = nfs_do_root_mount(&nfs4_remote_fs_type, flags, mount_info,
-			data->nfs_server.hostname);
-	data->nfs_server.export_path = export_path;
+	/* We create a mount for the server's root, walk to the requested
+	 * location and then create another mount for that.
+	 */
+	root_mnt = nfs_do_root_mount(ctx, ctx->nfs_server.hostname,
+				     NFS4_MOUNT_REMOTE);
+	if (IS_ERR(root_mnt))
+		return ERR_CAST(root_mnt);
 
-	res = nfs_follow_remote_path(root_mnt, export_path);
+	res = nfs_follow_remote_path(root_mnt, ctx->nfs_server.export_path);
+	if (res < 0)
+		ctx->mc.error = "NFS4: Couldn't follow remote path";
 
 	dfprintk(MOUNT, "<-- nfs4_try_mount() = %d%s\n",
 		 PTR_ERR_OR_ZERO(res),
@@ -259,64 +243,64 @@ struct dentry *nfs4_try_mount(int flags, const char *dev_name,
 	return res;
 }
 
-static struct dentry *
-nfs4_remote_referral_mount(struct file_system_type *fs_type, int flags,
-			   const char *dev_name, void *raw_data)
+static struct dentry *nfs4_remote_referral_mount(struct nfs_mount_context *ctx)
 {
-	struct nfs_mount_info mount_info = {
-		.fill_super = nfs_fill_super,
-		.set_security = nfs_clone_sb_security,
-		.cloned = raw_data,
-	};
 	struct nfs_server *server;
-	struct dentry *mntroot = ERR_PTR(-ENOMEM);
 
 	dprintk("--> nfs4_referral_get_sb()\n");
 
-	mount_info.mntfh = nfs_alloc_fhandle();
-	if (mount_info.cloned == NULL || mount_info.mntfh == NULL)
-		goto out;
+	ctx->set_security = nfs_clone_sb_security;
+
+	if (!ctx->clone_data.cloned)
+		return ERR_PTR(-EINVAL);
 
 	/* create a new volume representation */
-	server = nfs4_create_referral_server(mount_info.cloned, mount_info.mntfh);
-	if (IS_ERR(server)) {
-		mntroot = ERR_CAST(server);
-		goto out;
-	}
+	server = nfs4_create_referral_server(ctx);
+	if (IS_ERR(server))
+		return ERR_CAST(server);
 
-	mntroot = nfs_fs_mount_common(server, flags, dev_name, &mount_info, &nfs_v4);
-out:
-	nfs_free_fhandle(mount_info.mntfh);
-	return mntroot;
+	return nfs_fs_mount_common(server, ctx);
 }
 
 /*
  * Create an NFS4 server record on referral traversal
  */
-static struct dentry *nfs4_referral_mount(struct file_system_type *fs_type,
-		int flags, const char *dev_name, void *raw_data)
+static struct dentry *nfs4_referral_mount(struct nfs_mount_context *ctx)
 {
-	struct nfs_clone_mount *data = raw_data;
-	char *export_path;
 	struct vfsmount *root_mnt;
 	struct dentry *res;
 
 	dprintk("--> nfs4_referral_mount()\n");
 
-	export_path = data->mnt_path;
-	data->mnt_path = "/";
-
-	root_mnt = nfs_do_root_mount(&nfs4_remote_referral_fs_type,
-			flags, data, data->hostname);
-	data->mnt_path = export_path;
+	root_mnt = nfs_do_root_mount(ctx, ctx->nfs_server.hostname,
+				     NFS4_MOUNT_REMOTE_REFERRAL);
 
-	res = nfs_follow_remote_path(root_mnt, export_path);
+	res = nfs_follow_remote_path(root_mnt, ctx->nfs_server.export_path);
 	dprintk("<-- nfs4_referral_mount() = %d%s\n",
 		PTR_ERR_OR_ZERO(res),
 		IS_ERR(res) ? " [error]" : "");
 	return res;
 }
 
+/*
+ * Handle special NFS4 mount types.
+ */
+struct dentry *nfs4_mount(struct nfs_mount_context *ctx)
+{
+	switch (ctx->mount_type) {
+	case NFS4_MOUNT_REMOTE:
+		return nfs4_remote_mount(ctx);
+
+	case NFS4_MOUNT_REFERRAL:
+		return nfs4_referral_mount(ctx);
+
+	case NFS4_MOUNT_REMOTE_REFERRAL:
+		return nfs4_remote_referral_mount(ctx);
+
+	default:
+		return nfs_general_mount(ctx);
+	}
+}
 
 static int __init init_nfs_v4(void)
 {
diff --git a/fs/nfs/proc.c b/fs/nfs/proc.c
index b7bca8303989..edae9cd50412 100644
--- a/fs/nfs/proc.c
+++ b/fs/nfs/proc.c
@@ -704,6 +704,7 @@ const struct nfs_rpc_ops nfs_v2_clientops = {
 	.file_inode_ops	= &nfs_file_inode_operations,
 	.file_ops	= &nfs_file_operations,
 	.getroot	= nfs_proc_get_root,
+	.mount		= nfs_general_mount,
 	.submount	= nfs_submount,
 	.try_mount	= nfs_try_mount,
 	.getattr	= nfs_proc_getattr,
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index dc69314d455e..82d4a3071517 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -68,244 +68,6 @@
 #include "nfs.h"
 
 #define NFSDBG_FACILITY		NFSDBG_VFS
-#define NFS_TEXT_DATA		1
-
-#if IS_ENABLED(CONFIG_NFS_V3)
-#define NFS_DEFAULT_VERSION 3
-#else
-#define NFS_DEFAULT_VERSION 2
-#endif
-
-enum {
-	/* Mount options that take no arguments */
-	Opt_soft, Opt_hard,
-	Opt_posix, Opt_noposix,
-	Opt_cto, Opt_nocto,
-	Opt_ac, Opt_noac,
-	Opt_lock, Opt_nolock,
-	Opt_udp, Opt_tcp, Opt_rdma,
-	Opt_acl, Opt_noacl,
-	Opt_rdirplus, Opt_nordirplus,
-	Opt_sharecache, Opt_nosharecache,
-	Opt_resvport, Opt_noresvport,
-	Opt_fscache, Opt_nofscache,
-	Opt_migration, Opt_nomigration,
-
-	/* Mount options that take integer arguments */
-	Opt_port,
-	Opt_rsize, Opt_wsize, Opt_bsize,
-	Opt_timeo, Opt_retrans,
-	Opt_acregmin, Opt_acregmax,
-	Opt_acdirmin, Opt_acdirmax,
-	Opt_actimeo,
-	Opt_namelen,
-	Opt_mountport,
-	Opt_mountvers,
-	Opt_minorversion,
-
-	/* Mount options that take string arguments */
-	Opt_nfsvers,
-	Opt_sec, Opt_proto, Opt_mountproto, Opt_mounthost,
-	Opt_addr, Opt_mountaddr, Opt_clientaddr,
-	Opt_lookupcache,
-	Opt_fscache_uniq,
-	Opt_local_lock,
-
-	/* Special mount options */
-	Opt_userspace, Opt_deprecated, Opt_sloppy,
-
-	Opt_err
-};
-
-static const match_table_t nfs_mount_option_tokens = {
-	{ Opt_userspace, "bg" },
-	{ Opt_userspace, "fg" },
-	{ Opt_userspace, "retry=%s" },
-
-	{ Opt_sloppy, "sloppy" },
-
-	{ Opt_soft, "soft" },
-	{ Opt_hard, "hard" },
-	{ Opt_deprecated, "intr" },
-	{ Opt_deprecated, "nointr" },
-	{ Opt_posix, "posix" },
-	{ Opt_noposix, "noposix" },
-	{ Opt_cto, "cto" },
-	{ Opt_nocto, "nocto" },
-	{ Opt_ac, "ac" },
-	{ Opt_noac, "noac" },
-	{ Opt_lock, "lock" },
-	{ Opt_nolock, "nolock" },
-	{ Opt_udp, "udp" },
-	{ Opt_tcp, "tcp" },
-	{ Opt_rdma, "rdma" },
-	{ Opt_acl, "acl" },
-	{ Opt_noacl, "noacl" },
-	{ Opt_rdirplus, "rdirplus" },
-	{ Opt_nordirplus, "nordirplus" },
-	{ Opt_sharecache, "sharecache" },
-	{ Opt_nosharecache, "nosharecache" },
-	{ Opt_resvport, "resvport" },
-	{ Opt_noresvport, "noresvport" },
-	{ Opt_fscache, "fsc" },
-	{ Opt_nofscache, "nofsc" },
-	{ Opt_migration, "migration" },
-	{ Opt_nomigration, "nomigration" },
-
-	{ Opt_port, "port=%s" },
-	{ Opt_rsize, "rsize=%s" },
-	{ Opt_wsize, "wsize=%s" },
-	{ Opt_bsize, "bsize=%s" },
-	{ Opt_timeo, "timeo=%s" },
-	{ Opt_retrans, "retrans=%s" },
-	{ Opt_acregmin, "acregmin=%s" },
-	{ Opt_acregmax, "acregmax=%s" },
-	{ Opt_acdirmin, "acdirmin=%s" },
-	{ Opt_acdirmax, "acdirmax=%s" },
-	{ Opt_actimeo, "actimeo=%s" },
-	{ Opt_namelen, "namlen=%s" },
-	{ Opt_mountport, "mountport=%s" },
-	{ Opt_mountvers, "mountvers=%s" },
-	{ Opt_minorversion, "minorversion=%s" },
-
-	{ Opt_nfsvers, "nfsvers=%s" },
-	{ Opt_nfsvers, "vers=%s" },
-
-	{ Opt_sec, "sec=%s" },
-	{ Opt_proto, "proto=%s" },
-	{ Opt_mountproto, "mountproto=%s" },
-	{ Opt_addr, "addr=%s" },
-	{ Opt_clientaddr, "clientaddr=%s" },
-	{ Opt_mounthost, "mounthost=%s" },
-	{ Opt_mountaddr, "mountaddr=%s" },
-
-	{ Opt_lookupcache, "lookupcache=%s" },
-	{ Opt_fscache_uniq, "fsc=%s" },
-	{ Opt_local_lock, "local_lock=%s" },
-
-	/* The following needs to be listed after all other options */
-	{ Opt_nfsvers, "v%s" },
-
-	{ Opt_err, NULL }
-};
-
-enum {
-	Opt_xprt_udp, Opt_xprt_udp6, Opt_xprt_tcp, Opt_xprt_tcp6, Opt_xprt_rdma,
-	Opt_xprt_rdma6,
-
-	Opt_xprt_err
-};
-
-static const match_table_t nfs_xprt_protocol_tokens = {
-	{ Opt_xprt_udp, "udp" },
-	{ Opt_xprt_udp6, "udp6" },
-	{ Opt_xprt_tcp, "tcp" },
-	{ Opt_xprt_tcp6, "tcp6" },
-	{ Opt_xprt_rdma, "rdma" },
-	{ Opt_xprt_rdma6, "rdma6" },
-
-	{ Opt_xprt_err, NULL }
-};
-
-enum {
-	Opt_sec_none, Opt_sec_sys,
-	Opt_sec_krb5, Opt_sec_krb5i, Opt_sec_krb5p,
-	Opt_sec_lkey, Opt_sec_lkeyi, Opt_sec_lkeyp,
-	Opt_sec_spkm, Opt_sec_spkmi, Opt_sec_spkmp,
-
-	Opt_sec_err
-};
-
-static const match_table_t nfs_secflavor_tokens = {
-	{ Opt_sec_none, "none" },
-	{ Opt_sec_none, "null" },
-	{ Opt_sec_sys, "sys" },
-
-	{ Opt_sec_krb5, "krb5" },
-	{ Opt_sec_krb5i, "krb5i" },
-	{ Opt_sec_krb5p, "krb5p" },
-
-	{ Opt_sec_lkey, "lkey" },
-	{ Opt_sec_lkeyi, "lkeyi" },
-	{ Opt_sec_lkeyp, "lkeyp" },
-
-	{ Opt_sec_spkm, "spkm3" },
-	{ Opt_sec_spkmi, "spkm3i" },
-	{ Opt_sec_spkmp, "spkm3p" },
-
-	{ Opt_sec_err, NULL }
-};
-
-enum {
-	Opt_lookupcache_all, Opt_lookupcache_positive,
-	Opt_lookupcache_none,
-
-	Opt_lookupcache_err
-};
-
-static match_table_t nfs_lookupcache_tokens = {
-	{ Opt_lookupcache_all, "all" },
-	{ Opt_lookupcache_positive, "pos" },
-	{ Opt_lookupcache_positive, "positive" },
-	{ Opt_lookupcache_none, "none" },
-
-	{ Opt_lookupcache_err, NULL }
-};
-
-enum {
-	Opt_local_lock_all, Opt_local_lock_flock, Opt_local_lock_posix,
-	Opt_local_lock_none,
-
-	Opt_local_lock_err
-};
-
-static match_table_t nfs_local_lock_tokens = {
-	{ Opt_local_lock_all, "all" },
-	{ Opt_local_lock_flock, "flock" },
-	{ Opt_local_lock_posix, "posix" },
-	{ Opt_local_lock_none, "none" },
-
-	{ Opt_local_lock_err, NULL }
-};
-
-enum {
-	Opt_vers_2, Opt_vers_3, Opt_vers_4, Opt_vers_4_0,
-	Opt_vers_4_1, Opt_vers_4_2,
-
-	Opt_vers_err
-};
-
-static match_table_t nfs_vers_tokens = {
-	{ Opt_vers_2, "2" },
-	{ Opt_vers_3, "3" },
-	{ Opt_vers_4, "4" },
-	{ Opt_vers_4_0, "4.0" },
-	{ Opt_vers_4_1, "4.1" },
-	{ Opt_vers_4_2, "4.2" },
-
-	{ Opt_vers_err, NULL }
-};
-
-static struct dentry *nfs_xdev_mount(struct file_system_type *fs_type,
-		int flags, const char *dev_name, void *raw_data);
-
-struct file_system_type nfs_fs_type = {
-	.owner		= THIS_MODULE,
-	.name		= "nfs",
-	.mount		= nfs_fs_mount,
-	.kill_sb	= nfs_kill_super,
-	.fs_flags	= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
-};
-MODULE_ALIAS_FS("nfs");
-EXPORT_SYMBOL_GPL(nfs_fs_type);
-
-struct file_system_type nfs_xdev_fs_type = {
-	.owner		= THIS_MODULE,
-	.name		= "nfs",
-	.mount		= nfs_xdev_mount,
-	.kill_sb	= nfs_kill_super,
-	.fs_flags	= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
-};
 
 const struct super_operations nfs_sops = {
 	.alloc_inode	= nfs_alloc_inode,
@@ -319,26 +81,11 @@ const struct super_operations nfs_sops = {
 	.show_devname	= nfs_show_devname,
 	.show_path	= nfs_show_path,
 	.show_stats	= nfs_show_stats,
-	.remount_fs	= nfs_remount,
+	.remount_fs_mc	= nfs_remount,
 };
 EXPORT_SYMBOL_GPL(nfs_sops);
 
 #if IS_ENABLED(CONFIG_NFS_V4)
-static void nfs4_validate_mount_flags(struct nfs_parsed_mount_data *);
-static int nfs4_validate_mount_data(void *options,
-	struct nfs_parsed_mount_data *args, const char *dev_name);
-
-struct file_system_type nfs4_fs_type = {
-	.owner		= THIS_MODULE,
-	.name		= "nfs4",
-	.mount		= nfs_fs_mount,
-	.kill_sb	= nfs_kill_super,
-	.fs_flags	= FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
-};
-MODULE_ALIAS_FS("nfs4");
-MODULE_ALIAS("nfs4");
-EXPORT_SYMBOL_GPL(nfs4_fs_type);
-
 static int __init register_nfs4_fs(void)
 {
 	return register_filesystem(&nfs4_fs_type);
@@ -910,141 +657,6 @@ void nfs_umount_begin(struct super_block *sb)
 }
 EXPORT_SYMBOL_GPL(nfs_umount_begin);
 
-static struct nfs_parsed_mount_data *nfs_alloc_parsed_mount_data(void)
-{
-	struct nfs_parsed_mount_data *data;
-
-	data = kzalloc(sizeof(*data), GFP_KERNEL);
-	if (data) {
-		data->timeo		= NFS_UNSPEC_TIMEO;
-		data->retrans		= NFS_UNSPEC_RETRANS;
-		data->acregmin		= NFS_DEF_ACREGMIN;
-		data->acregmax		= NFS_DEF_ACREGMAX;
-		data->acdirmin		= NFS_DEF_ACDIRMIN;
-		data->acdirmax		= NFS_DEF_ACDIRMAX;
-		data->mount_server.port	= NFS_UNSPEC_PORT;
-		data->nfs_server.port	= NFS_UNSPEC_PORT;
-		data->nfs_server.protocol = XPRT_TRANSPORT_TCP;
-		data->selected_flavor	= RPC_AUTH_MAXFLAVOR;
-		data->minorversion	= 0;
-		data->need_mount	= true;
-		data->net		= current->nsproxy->net_ns;
-		security_init_mnt_opts(&data->lsm_opts);
-	}
-	return data;
-}
-
-static void nfs_free_parsed_mount_data(struct nfs_parsed_mount_data *data)
-{
-	if (data) {
-		kfree(data->client_address);
-		kfree(data->mount_server.hostname);
-		kfree(data->nfs_server.export_path);
-		kfree(data->nfs_server.hostname);
-		kfree(data->fscache_uniq);
-		security_free_mnt_opts(&data->lsm_opts);
-		kfree(data);
-	}
-}
-
-/*
- * Sanity-check a server address provided by the mount command.
- *
- * Address family must be initialized, and address must not be
- * the ANY address for that family.
- */
-static int nfs_verify_server_address(struct sockaddr *addr)
-{
-	switch (addr->sa_family) {
-	case AF_INET: {
-		struct sockaddr_in *sa = (struct sockaddr_in *)addr;
-		return sa->sin_addr.s_addr != htonl(INADDR_ANY);
-	}
-	case AF_INET6: {
-		struct in6_addr *sa = &((struct sockaddr_in6 *)addr)->sin6_addr;
-		return !ipv6_addr_any(sa);
-	}
-	}
-
-	dfprintk(MOUNT, "NFS: Invalid IP address specified\n");
-	return 0;
-}
-
-/*
- * Select between a default port value and a user-specified port value.
- * If a zero value is set, then autobind will be used.
- */
-static void nfs_set_port(struct sockaddr *sap, int *port,
-				 const unsigned short default_port)
-{
-	if (*port == NFS_UNSPEC_PORT)
-		*port = default_port;
-
-	rpc_set_port(sap, *port);
-}
-
-/*
- * Sanity check the NFS transport protocol.
- *
- */
-static void nfs_validate_transport_protocol(struct nfs_parsed_mount_data *mnt)
-{
-	switch (mnt->nfs_server.protocol) {
-	case XPRT_TRANSPORT_UDP:
-	case XPRT_TRANSPORT_TCP:
-	case XPRT_TRANSPORT_RDMA:
-		break;
-	default:
-		mnt->nfs_server.protocol = XPRT_TRANSPORT_TCP;
-	}
-}
-
-/*
- * For text based NFSv2/v3 mounts, the mount protocol transport default
- * settings should depend upon the specified NFS transport.
- */
-static void nfs_set_mount_transport_protocol(struct nfs_parsed_mount_data *mnt)
-{
-	nfs_validate_transport_protocol(mnt);
-
-	if (mnt->mount_server.protocol == XPRT_TRANSPORT_UDP ||
-	    mnt->mount_server.protocol == XPRT_TRANSPORT_TCP)
-			return;
-	switch (mnt->nfs_server.protocol) {
-	case XPRT_TRANSPORT_UDP:
-		mnt->mount_server.protocol = XPRT_TRANSPORT_UDP;
-		break;
-	case XPRT_TRANSPORT_TCP:
-	case XPRT_TRANSPORT_RDMA:
-		mnt->mount_server.protocol = XPRT_TRANSPORT_TCP;
-	}
-}
-
-/*
- * Add 'flavor' to 'auth_info' if not already present.
- * Returns true if 'flavor' ends up in the list, false otherwise
- */
-static bool nfs_auth_info_add(struct nfs_auth_info *auth_info,
-			      rpc_authflavor_t flavor)
-{
-	unsigned int i;
-	unsigned int max_flavor_len = ARRAY_SIZE(auth_info->flavors);
-
-	/* make sure this flavor isn't already in the list */
-	for (i = 0; i < auth_info->flavor_len; i++) {
-		if (flavor == auth_info->flavors[i])
-			return true;
-	}
-
-	if (auth_info->flavor_len + 1 >= max_flavor_len) {
-		dfprintk(MOUNT, "NFS: too many sec= flavors\n");
-		return false;
-	}
-
-	auth_info->flavors[auth_info->flavor_len++] = flavor;
-	return true;
-}
-
 /*
  * Return true if 'match' is in auth_info or auth_info is empty.
  * Return false otherwise.
@@ -1066,628 +678,11 @@ bool nfs_auth_info_match(const struct nfs_auth_info *auth_info,
 EXPORT_SYMBOL_GPL(nfs_auth_info_match);
 
 /*
- * Parse the value of the 'sec=' option.
- */
-static int nfs_parse_security_flavors(char *value,
-				      struct nfs_parsed_mount_data *mnt)
-{
-	substring_t args[MAX_OPT_ARGS];
-	rpc_authflavor_t pseudoflavor;
-	char *p;
-
-	dfprintk(MOUNT, "NFS: parsing sec=%s option\n", value);
-
-	while ((p = strsep(&value, ":")) != NULL) {
-		switch (match_token(p, nfs_secflavor_tokens, args)) {
-		case Opt_sec_none:
-			pseudoflavor = RPC_AUTH_NULL;
-			break;
-		case Opt_sec_sys:
-			pseudoflavor = RPC_AUTH_UNIX;
-			break;
-		case Opt_sec_krb5:
-			pseudoflavor = RPC_AUTH_GSS_KRB5;
-			break;
-		case Opt_sec_krb5i:
-			pseudoflavor = RPC_AUTH_GSS_KRB5I;
-			break;
-		case Opt_sec_krb5p:
-			pseudoflavor = RPC_AUTH_GSS_KRB5P;
-			break;
-		case Opt_sec_lkey:
-			pseudoflavor = RPC_AUTH_GSS_LKEY;
-			break;
-		case Opt_sec_lkeyi:
-			pseudoflavor = RPC_AUTH_GSS_LKEYI;
-			break;
-		case Opt_sec_lkeyp:
-			pseudoflavor = RPC_AUTH_GSS_LKEYP;
-			break;
-		case Opt_sec_spkm:
-			pseudoflavor = RPC_AUTH_GSS_SPKM;
-			break;
-		case Opt_sec_spkmi:
-			pseudoflavor = RPC_AUTH_GSS_SPKMI;
-			break;
-		case Opt_sec_spkmp:
-			pseudoflavor = RPC_AUTH_GSS_SPKMP;
-			break;
-		default:
-			dfprintk(MOUNT,
-				 "NFS: sec= option '%s' not recognized\n", p);
-			return 0;
-		}
-
-		if (!nfs_auth_info_add(&mnt->auth_info, pseudoflavor))
-			return 0;
-	}
-
-	return 1;
-}
-
-static int nfs_parse_version_string(char *string,
-		struct nfs_parsed_mount_data *mnt,
-		substring_t *args)
-{
-	mnt->flags &= ~NFS_MOUNT_VER3;
-	switch (match_token(string, nfs_vers_tokens, args)) {
-	case Opt_vers_2:
-		mnt->version = 2;
-		break;
-	case Opt_vers_3:
-		mnt->flags |= NFS_MOUNT_VER3;
-		mnt->version = 3;
-		break;
-	case Opt_vers_4:
-		/* Backward compatibility option. In future,
-		 * the mount program should always supply
-		 * a NFSv4 minor version number.
-		 */
-		mnt->version = 4;
-		break;
-	case Opt_vers_4_0:
-		mnt->version = 4;
-		mnt->minorversion = 0;
-		break;
-	case Opt_vers_4_1:
-		mnt->version = 4;
-		mnt->minorversion = 1;
-		break;
-	case Opt_vers_4_2:
-		mnt->version = 4;
-		mnt->minorversion = 2;
-		break;
-	default:
-		return 0;
-	}
-	return 1;
-}
-
-static int nfs_get_option_str(substring_t args[], char **option)
-{
-	kfree(*option);
-	*option = match_strdup(args);
-	return !*option;
-}
-
-static int nfs_get_option_ul(substring_t args[], unsigned long *option)
-{
-	int rc;
-	char *string;
-
-	string = match_strdup(args);
-	if (string == NULL)
-		return -ENOMEM;
-	rc = kstrtoul(string, 10, option);
-	kfree(string);
-
-	return rc;
-}
-
-static int nfs_get_option_ul_bound(substring_t args[], unsigned long *option,
-		unsigned long l_bound, unsigned long u_bound)
-{
-	int ret;
-
-	ret = nfs_get_option_ul(args, option);
-	if (ret != 0)
-		return ret;
-	if (*option < l_bound || *option > u_bound)
-		return -ERANGE;
-	return 0;
-}
-
-/*
- * Error-check and convert a string of mount options from user space into
- * a data structure.  The whole mount string is processed; bad options are
- * skipped as they are encountered.  If there were no errors, return 1;
- * otherwise return 0 (zero).
- */
-static int nfs_parse_mount_options(char *raw,
-				   struct nfs_parsed_mount_data *mnt)
-{
-	char *p, *string, *secdata;
-	int rc, sloppy = 0, invalid_option = 0;
-	unsigned short protofamily = AF_UNSPEC;
-	unsigned short mountfamily = AF_UNSPEC;
-
-	if (!raw) {
-		dfprintk(MOUNT, "NFS: mount options string was NULL.\n");
-		return 1;
-	}
-	dfprintk(MOUNT, "NFS: nfs mount opts='%s'\n", raw);
-
-	secdata = alloc_secdata();
-	if (!secdata)
-		goto out_nomem;
-
-	rc = security_sb_copy_data(raw, secdata);
-	if (rc)
-		goto out_security_failure;
-
-	rc = security_sb_parse_opts_str(secdata, &mnt->lsm_opts);
-	if (rc)
-		goto out_security_failure;
-
-	free_secdata(secdata);
-
-	while ((p = strsep(&raw, ",")) != NULL) {
-		substring_t args[MAX_OPT_ARGS];
-		unsigned long option;
-		int token;
-
-		if (!*p)
-			continue;
-
-		dfprintk(MOUNT, "NFS:   parsing nfs mount option '%s'\n", p);
-
-		token = match_token(p, nfs_mount_option_tokens, args);
-		switch (token) {
-
-		/*
-		 * boolean options:  foo/nofoo
-		 */
-		case Opt_soft:
-			mnt->flags |= NFS_MOUNT_SOFT;
-			break;
-		case Opt_hard:
-			mnt->flags &= ~NFS_MOUNT_SOFT;
-			break;
-		case Opt_posix:
-			mnt->flags |= NFS_MOUNT_POSIX;
-			break;
-		case Opt_noposix:
-			mnt->flags &= ~NFS_MOUNT_POSIX;
-			break;
-		case Opt_cto:
-			mnt->flags &= ~NFS_MOUNT_NOCTO;
-			break;
-		case Opt_nocto:
-			mnt->flags |= NFS_MOUNT_NOCTO;
-			break;
-		case Opt_ac:
-			mnt->flags &= ~NFS_MOUNT_NOAC;
-			break;
-		case Opt_noac:
-			mnt->flags |= NFS_MOUNT_NOAC;
-			break;
-		case Opt_lock:
-			mnt->flags &= ~NFS_MOUNT_NONLM;
-			mnt->flags &= ~(NFS_MOUNT_LOCAL_FLOCK |
-					NFS_MOUNT_LOCAL_FCNTL);
-			break;
-		case Opt_nolock:
-			mnt->flags |= NFS_MOUNT_NONLM;
-			mnt->flags |= (NFS_MOUNT_LOCAL_FLOCK |
-				       NFS_MOUNT_LOCAL_FCNTL);
-			break;
-		case Opt_udp:
-			mnt->flags &= ~NFS_MOUNT_TCP;
-			mnt->nfs_server.protocol = XPRT_TRANSPORT_UDP;
-			break;
-		case Opt_tcp:
-			mnt->flags |= NFS_MOUNT_TCP;
-			mnt->nfs_server.protocol = XPRT_TRANSPORT_TCP;
-			break;
-		case Opt_rdma:
-			mnt->flags |= NFS_MOUNT_TCP; /* for side protocols */
-			mnt->nfs_server.protocol = XPRT_TRANSPORT_RDMA;
-			xprt_load_transport(p);
-			break;
-		case Opt_acl:
-			mnt->flags &= ~NFS_MOUNT_NOACL;
-			break;
-		case Opt_noacl:
-			mnt->flags |= NFS_MOUNT_NOACL;
-			break;
-		case Opt_rdirplus:
-			mnt->flags &= ~NFS_MOUNT_NORDIRPLUS;
-			break;
-		case Opt_nordirplus:
-			mnt->flags |= NFS_MOUNT_NORDIRPLUS;
-			break;
-		case Opt_sharecache:
-			mnt->flags &= ~NFS_MOUNT_UNSHARED;
-			break;
-		case Opt_nosharecache:
-			mnt->flags |= NFS_MOUNT_UNSHARED;
-			break;
-		case Opt_resvport:
-			mnt->flags &= ~NFS_MOUNT_NORESVPORT;
-			break;
-		case Opt_noresvport:
-			mnt->flags |= NFS_MOUNT_NORESVPORT;
-			break;
-		case Opt_fscache:
-			mnt->options |= NFS_OPTION_FSCACHE;
-			kfree(mnt->fscache_uniq);
-			mnt->fscache_uniq = NULL;
-			break;
-		case Opt_nofscache:
-			mnt->options &= ~NFS_OPTION_FSCACHE;
-			kfree(mnt->fscache_uniq);
-			mnt->fscache_uniq = NULL;
-			break;
-		case Opt_migration:
-			mnt->options |= NFS_OPTION_MIGRATION;
-			break;
-		case Opt_nomigration:
-			mnt->options &= NFS_OPTION_MIGRATION;
-			break;
-
-		/*
-		 * options that take numeric values
-		 */
-		case Opt_port:
-			if (nfs_get_option_ul(args, &option) ||
-			    option > USHRT_MAX)
-				goto out_invalid_value;
-			mnt->nfs_server.port = option;
-			break;
-		case Opt_rsize:
-			if (nfs_get_option_ul(args, &option))
-				goto out_invalid_value;
-			mnt->rsize = option;
-			break;
-		case Opt_wsize:
-			if (nfs_get_option_ul(args, &option))
-				goto out_invalid_value;
-			mnt->wsize = option;
-			break;
-		case Opt_bsize:
-			if (nfs_get_option_ul(args, &option))
-				goto out_invalid_value;
-			mnt->bsize = option;
-			break;
-		case Opt_timeo:
-			if (nfs_get_option_ul_bound(args, &option, 1, INT_MAX))
-				goto out_invalid_value;
-			mnt->timeo = option;
-			break;
-		case Opt_retrans:
-			if (nfs_get_option_ul_bound(args, &option, 0, INT_MAX))
-				goto out_invalid_value;
-			mnt->retrans = option;
-			break;
-		case Opt_acregmin:
-			if (nfs_get_option_ul(args, &option))
-				goto out_invalid_value;
-			mnt->acregmin = option;
-			break;
-		case Opt_acregmax:
-			if (nfs_get_option_ul(args, &option))
-				goto out_invalid_value;
-			mnt->acregmax = option;
-			break;
-		case Opt_acdirmin:
-			if (nfs_get_option_ul(args, &option))
-				goto out_invalid_value;
-			mnt->acdirmin = option;
-			break;
-		case Opt_acdirmax:
-			if (nfs_get_option_ul(args, &option))
-				goto out_invalid_value;
-			mnt->acdirmax = option;
-			break;
-		case Opt_actimeo:
-			if (nfs_get_option_ul(args, &option))
-				goto out_invalid_value;
-			mnt->acregmin = mnt->acregmax =
-			mnt->acdirmin = mnt->acdirmax = option;
-			break;
-		case Opt_namelen:
-			if (nfs_get_option_ul(args, &option))
-				goto out_invalid_value;
-			mnt->namlen = option;
-			break;
-		case Opt_mountport:
-			if (nfs_get_option_ul(args, &option) ||
-			    option > USHRT_MAX)
-				goto out_invalid_value;
-			mnt->mount_server.port = option;
-			break;
-		case Opt_mountvers:
-			if (nfs_get_option_ul(args, &option) ||
-			    option < NFS_MNT_VERSION ||
-			    option > NFS_MNT3_VERSION)
-				goto out_invalid_value;
-			mnt->mount_server.version = option;
-			break;
-		case Opt_minorversion:
-			if (nfs_get_option_ul(args, &option))
-				goto out_invalid_value;
-			if (option > NFS4_MAX_MINOR_VERSION)
-				goto out_invalid_value;
-			mnt->minorversion = option;
-			break;
-
-		/*
-		 * options that take text values
-		 */
-		case Opt_nfsvers:
-			string = match_strdup(args);
-			if (string == NULL)
-				goto out_nomem;
-			rc = nfs_parse_version_string(string, mnt, args);
-			kfree(string);
-			if (!rc)
-				goto out_invalid_value;
-			break;
-		case Opt_sec:
-			string = match_strdup(args);
-			if (string == NULL)
-				goto out_nomem;
-			rc = nfs_parse_security_flavors(string, mnt);
-			kfree(string);
-			if (!rc) {
-				dfprintk(MOUNT, "NFS:   unrecognized "
-						"security flavor\n");
-				return 0;
-			}
-			break;
-		case Opt_proto:
-			string = match_strdup(args);
-			if (string == NULL)
-				goto out_nomem;
-			token = match_token(string,
-					    nfs_xprt_protocol_tokens, args);
-
-			protofamily = AF_INET;
-			switch (token) {
-			case Opt_xprt_udp6:
-				protofamily = AF_INET6;
-			case Opt_xprt_udp:
-				mnt->flags &= ~NFS_MOUNT_TCP;
-				mnt->nfs_server.protocol = XPRT_TRANSPORT_UDP;
-				break;
-			case Opt_xprt_tcp6:
-				protofamily = AF_INET6;
-			case Opt_xprt_tcp:
-				mnt->flags |= NFS_MOUNT_TCP;
-				mnt->nfs_server.protocol = XPRT_TRANSPORT_TCP;
-				break;
-			case Opt_xprt_rdma6:
-				protofamily = AF_INET6;
-			case Opt_xprt_rdma:
-				/* vector side protocols to TCP */
-				mnt->flags |= NFS_MOUNT_TCP;
-				mnt->nfs_server.protocol = XPRT_TRANSPORT_RDMA;
-				xprt_load_transport(string);
-				break;
-			default:
-				dfprintk(MOUNT, "NFS:   unrecognized "
-						"transport protocol\n");
-				kfree(string);
-				return 0;
-			}
-			kfree(string);
-			break;
-		case Opt_mountproto:
-			string = match_strdup(args);
-			if (string == NULL)
-				goto out_nomem;
-			token = match_token(string,
-					    nfs_xprt_protocol_tokens, args);
-			kfree(string);
-
-			mountfamily = AF_INET;
-			switch (token) {
-			case Opt_xprt_udp6:
-				mountfamily = AF_INET6;
-			case Opt_xprt_udp:
-				mnt->mount_server.protocol = XPRT_TRANSPORT_UDP;
-				break;
-			case Opt_xprt_tcp6:
-				mountfamily = AF_INET6;
-			case Opt_xprt_tcp:
-				mnt->mount_server.protocol = XPRT_TRANSPORT_TCP;
-				break;
-			case Opt_xprt_rdma: /* not used for side protocols */
-			default:
-				dfprintk(MOUNT, "NFS:   unrecognized "
-						"transport protocol\n");
-				return 0;
-			}
-			break;
-		case Opt_addr:
-			string = match_strdup(args);
-			if (string == NULL)
-				goto out_nomem;
-			mnt->nfs_server.addrlen =
-				rpc_pton(mnt->net, string, strlen(string),
-					(struct sockaddr *)
-					&mnt->nfs_server.address,
-					sizeof(mnt->nfs_server.address));
-			kfree(string);
-			if (mnt->nfs_server.addrlen == 0)
-				goto out_invalid_address;
-			break;
-		case Opt_clientaddr:
-			if (nfs_get_option_str(args, &mnt->client_address))
-				goto out_nomem;
-			break;
-		case Opt_mounthost:
-			if (nfs_get_option_str(args,
-					       &mnt->mount_server.hostname))
-				goto out_nomem;
-			break;
-		case Opt_mountaddr:
-			string = match_strdup(args);
-			if (string == NULL)
-				goto out_nomem;
-			mnt->mount_server.addrlen =
-				rpc_pton(mnt->net, string, strlen(string),
-					(struct sockaddr *)
-					&mnt->mount_server.address,
-					sizeof(mnt->mount_server.address));
-			kfree(string);
-			if (mnt->mount_server.addrlen == 0)
-				goto out_invalid_address;
-			break;
-		case Opt_lookupcache:
-			string = match_strdup(args);
-			if (string == NULL)
-				goto out_nomem;
-			token = match_token(string,
-					nfs_lookupcache_tokens, args);
-			kfree(string);
-			switch (token) {
-				case Opt_lookupcache_all:
-					mnt->flags &= ~(NFS_MOUNT_LOOKUP_CACHE_NONEG|NFS_MOUNT_LOOKUP_CACHE_NONE);
-					break;
-				case Opt_lookupcache_positive:
-					mnt->flags &= ~NFS_MOUNT_LOOKUP_CACHE_NONE;
-					mnt->flags |= NFS_MOUNT_LOOKUP_CACHE_NONEG;
-					break;
-				case Opt_lookupcache_none:
-					mnt->flags |= NFS_MOUNT_LOOKUP_CACHE_NONEG|NFS_MOUNT_LOOKUP_CACHE_NONE;
-					break;
-				default:
-					dfprintk(MOUNT, "NFS:   invalid "
-							"lookupcache argument\n");
-					return 0;
-			};
-			break;
-		case Opt_fscache_uniq:
-			if (nfs_get_option_str(args, &mnt->fscache_uniq))
-				goto out_nomem;
-			mnt->options |= NFS_OPTION_FSCACHE;
-			break;
-		case Opt_local_lock:
-			string = match_strdup(args);
-			if (string == NULL)
-				goto out_nomem;
-			token = match_token(string, nfs_local_lock_tokens,
-					args);
-			kfree(string);
-			switch (token) {
-			case Opt_local_lock_all:
-				mnt->flags |= (NFS_MOUNT_LOCAL_FLOCK |
-					       NFS_MOUNT_LOCAL_FCNTL);
-				break;
-			case Opt_local_lock_flock:
-				mnt->flags |= NFS_MOUNT_LOCAL_FLOCK;
-				break;
-			case Opt_local_lock_posix:
-				mnt->flags |= NFS_MOUNT_LOCAL_FCNTL;
-				break;
-			case Opt_local_lock_none:
-				mnt->flags &= ~(NFS_MOUNT_LOCAL_FLOCK |
-						NFS_MOUNT_LOCAL_FCNTL);
-				break;
-			default:
-				dfprintk(MOUNT, "NFS:	invalid	"
-						"local_lock argument\n");
-				return 0;
-			};
-			break;
-
-		/*
-		 * Special options
-		 */
-		case Opt_sloppy:
-			sloppy = 1;
-			dfprintk(MOUNT, "NFS:   relaxing parsing rules\n");
-			break;
-		case Opt_userspace:
-		case Opt_deprecated:
-			dfprintk(MOUNT, "NFS:   ignoring mount option "
-					"'%s'\n", p);
-			break;
-
-		default:
-			invalid_option = 1;
-			dfprintk(MOUNT, "NFS:   unrecognized mount option "
-					"'%s'\n", p);
-		}
-	}
-
-	if (!sloppy && invalid_option)
-		return 0;
-
-	if (mnt->minorversion && mnt->version != 4)
-		goto out_minorversion_mismatch;
-
-	if (mnt->options & NFS_OPTION_MIGRATION &&
-	    (mnt->version != 4 || mnt->minorversion != 0))
-		goto out_migration_misuse;
-
-	/*
-	 * verify that any proto=/mountproto= options match the address
-	 * families in the addr=/mountaddr= options.
-	 */
-	if (protofamily != AF_UNSPEC &&
-	    protofamily != mnt->nfs_server.address.ss_family)
-		goto out_proto_mismatch;
-
-	if (mountfamily != AF_UNSPEC) {
-		if (mnt->mount_server.addrlen) {
-			if (mountfamily != mnt->mount_server.address.ss_family)
-				goto out_mountproto_mismatch;
-		} else {
-			if (mountfamily != mnt->nfs_server.address.ss_family)
-				goto out_mountproto_mismatch;
-		}
-	}
-
-	return 1;
-
-out_mountproto_mismatch:
-	printk(KERN_INFO "NFS: mount server address does not match mountproto= "
-			 "option\n");
-	return 0;
-out_proto_mismatch:
-	printk(KERN_INFO "NFS: server address does not match proto= option\n");
-	return 0;
-out_invalid_address:
-	printk(KERN_INFO "NFS: bad IP address specified: %s\n", p);
-	return 0;
-out_invalid_value:
-	printk(KERN_INFO "NFS: bad mount option value specified: %s\n", p);
-	return 0;
-out_minorversion_mismatch:
-	printk(KERN_INFO "NFS: mount option vers=%u does not support "
-			 "minorversion=%u\n", mnt->version, mnt->minorversion);
-	return 0;
-out_migration_misuse:
-	printk(KERN_INFO
-		"NFS: 'migration' not supported for this NFS version\n");
-	return 0;
-out_nomem:
-	printk(KERN_INFO "NFS: not enough memory to parse option\n");
-	return 0;
-out_security_failure:
-	free_secdata(secdata);
-	printk(KERN_INFO "NFS: security options invalid: %d\n", rc);
-	return 0;
-}
-
-/*
  * Ensure that a specified authtype in args->auth_info is supported by
  * the server. Returns 0 and sets args->selected_flavor if it's ok, and
  * -EACCES if not.
  */
-static int nfs_verify_authflavors(struct nfs_parsed_mount_data *args,
+static int nfs_verify_authflavors(struct nfs_mount_context *args,
 			rpc_authflavor_t *server_authlist, unsigned int count)
 {
 	rpc_authflavor_t flavor = RPC_AUTH_MAXFLAVOR;
@@ -1731,7 +726,7 @@ static int nfs_verify_authflavors(struct nfs_parsed_mount_data *args,
  * Use the remote server's MOUNT service to request the NFS file handle
  * corresponding to the provided path.
  */
-static int nfs_request_mount(struct nfs_parsed_mount_data *args,
+static int nfs_request_mount(struct nfs_mount_context *args,
 			     struct nfs_fh *root_fh,
 			     rpc_authflavor_t *server_authlist,
 			     unsigned int *server_authlist_len)
@@ -1745,7 +740,7 @@ static int nfs_request_mount(struct nfs_parsed_mount_data *args,
 		.noresvport	= args->flags & NFS_MOUNT_NORESVPORT,
 		.auth_flav_len	= server_authlist_len,
 		.auth_flavs	= server_authlist,
-		.net		= args->net,
+		.net		= args->mc.net_ns,
 	};
 	int status;
 
@@ -1768,7 +763,7 @@ static int nfs_request_mount(struct nfs_parsed_mount_data *args,
 	/*
 	 * Construct the mount server's address.
 	 */
-	if (args->mount_server.address.ss_family == AF_UNSPEC) {
+	if (args->mount_server.address.sa_family == AF_UNSPEC) {
 		memcpy(request.sap, &args->nfs_server.address,
 		       args->nfs_server.addrlen);
 		args->mount_server.addrlen = args->nfs_server.addrlen;
@@ -1790,20 +785,17 @@ static int nfs_request_mount(struct nfs_parsed_mount_data *args,
 	return 0;
 }
 
-static struct nfs_server *nfs_try_mount_request(struct nfs_mount_info *mount_info,
-					struct nfs_subversion *nfs_mod)
+static struct nfs_server *nfs_try_mount_request(struct nfs_mount_context *ctx)
 {
 	int status;
 	unsigned int i;
 	bool tried_auth_unix = false;
 	bool auth_null_in_list = false;
 	struct nfs_server *server = ERR_PTR(-EACCES);
-	struct nfs_parsed_mount_data *args = mount_info->parsed;
 	rpc_authflavor_t authlist[NFS_MAX_SECFLAVORS];
 	unsigned int authlist_len = ARRAY_SIZE(authlist);
 
-	status = nfs_request_mount(args, mount_info->mntfh, authlist,
-					&authlist_len);
+	status = nfs_request_mount(ctx, ctx->mntfh, authlist, &authlist_len);
 	if (status)
 		return ERR_PTR(status);
 
@@ -1811,13 +803,13 @@ static struct nfs_server *nfs_try_mount_request(struct nfs_mount_info *mount_inf
 	 * Was a sec= authflavor specified in the options? First, verify
 	 * whether the server supports it, and then just try to use it if so.
 	 */
-	if (args->auth_info.flavor_len > 0) {
-		status = nfs_verify_authflavors(args, authlist, authlist_len);
+	if (ctx->auth_info.flavor_len > 0) {
+		status = nfs_verify_authflavors(ctx, authlist, authlist_len);
 		dfprintk(MOUNT, "NFS: using auth flavor %u\n",
-			 args->selected_flavor);
+			 ctx->selected_flavor);
 		if (status)
 			return ERR_PTR(status);
-		return nfs_mod->rpc_ops->create_server(mount_info, nfs_mod);
+		return ctx->nfs_mod->rpc_ops->create_server(ctx);
 	}
 
 	/*
@@ -1843,8 +835,8 @@ static struct nfs_server *nfs_try_mount_request(struct nfs_mount_info *mount_inf
 			/* Fallthrough */
 		}
 		dfprintk(MOUNT, "NFS: attempting to use auth flavor %u\n", flavor);
-		args->selected_flavor = flavor;
-		server = nfs_mod->rpc_ops->create_server(mount_info, nfs_mod);
+		ctx->selected_flavor = flavor;
+		server = ctx->nfs_mod->rpc_ops->create_server(ctx);
 		if (!IS_ERR(server))
 			return server;
 	}
@@ -1859,338 +851,30 @@ static struct nfs_server *nfs_try_mount_request(struct nfs_mount_info *mount_inf
 
 	/* Last chance! Try AUTH_UNIX */
 	dfprintk(MOUNT, "NFS: attempting to use auth flavor %u\n", RPC_AUTH_UNIX);
-	args->selected_flavor = RPC_AUTH_UNIX;
-	return nfs_mod->rpc_ops->create_server(mount_info, nfs_mod);
+	ctx->selected_flavor = RPC_AUTH_UNIX;
+	return ctx->nfs_mod->rpc_ops->create_server(ctx);
 }
 
-struct dentry *nfs_try_mount(int flags, const char *dev_name,
-			     struct nfs_mount_info *mount_info,
-			     struct nfs_subversion *nfs_mod)
+struct dentry *nfs_try_mount(struct nfs_mount_context *ctx)
 {
 	struct nfs_server *server;
 
-	if (mount_info->parsed->need_mount)
-		server = nfs_try_mount_request(mount_info, nfs_mod);
+	pr_notice("*** nfs_try_mount\n");
+
+	if (ctx->need_mount)
+		server = nfs_try_mount_request(ctx);
 	else
-		server = nfs_mod->rpc_ops->create_server(mount_info, nfs_mod);
+		server = ctx->nfs_mod->rpc_ops->create_server(ctx);
 
-	if (IS_ERR(server))
+	if (IS_ERR(server)) {
+		ctx->mc.error = "NFS: Couldn't create server";
 		return ERR_CAST(server);
-
-	return nfs_fs_mount_common(server, flags, dev_name, mount_info, nfs_mod);
-}
-EXPORT_SYMBOL_GPL(nfs_try_mount);
-
-/*
- * Split "dev_name" into "hostname:export_path".
- *
- * The leftmost colon demarks the split between the server's hostname
- * and the export path.  If the hostname starts with a left square
- * bracket, then it may contain colons.
- *
- * Note: caller frees hostname and export path, even on error.
- */
-static int nfs_parse_devname(const char *dev_name,
-			     char **hostname, size_t maxnamlen,
-			     char **export_path, size_t maxpathlen)
-{
-	size_t len;
-	char *end;
-
-	/* Is the host name protected with square brakcets? */
-	if (*dev_name == '[') {
-		end = strchr(++dev_name, ']');
-		if (end == NULL || end[1] != ':')
-			goto out_bad_devname;
-
-		len = end - dev_name;
-		end++;
-	} else {
-		char *comma;
-
-		end = strchr(dev_name, ':');
-		if (end == NULL)
-			goto out_bad_devname;
-		len = end - dev_name;
-
-		/* kill possible hostname list: not supported */
-		comma = strchr(dev_name, ',');
-		if (comma != NULL && comma < end)
-			*comma = 0;
 	}
 
-	if (len > maxnamlen)
-		goto out_hostname;
-
-	/* N.B. caller will free nfs_server.hostname in all cases */
-	*hostname = kstrndup(dev_name, len, GFP_KERNEL);
-	if (*hostname == NULL)
-		goto out_nomem;
-	len = strlen(++end);
-	if (len > maxpathlen)
-		goto out_path;
-	*export_path = kstrndup(end, len, GFP_KERNEL);
-	if (!*export_path)
-		goto out_nomem;
-
-	dfprintk(MOUNT, "NFS: MNTPATH: '%s'\n", *export_path);
-	return 0;
-
-out_bad_devname:
-	dfprintk(MOUNT, "NFS: device name not in host:path format\n");
-	return -EINVAL;
-
-out_nomem:
-	dfprintk(MOUNT, "NFS: not enough memory to parse device name\n");
-	return -ENOMEM;
-
-out_hostname:
-	dfprintk(MOUNT, "NFS: server hostname too long\n");
-	return -ENAMETOOLONG;
-
-out_path:
-	dfprintk(MOUNT, "NFS: export pathname too long\n");
-	return -ENAMETOOLONG;
+	return nfs_fs_mount_common(server, ctx);
 }
+EXPORT_SYMBOL_GPL(nfs_try_mount);
 
-/*
- * Validate the NFS2/NFS3 mount data
- * - fills in the mount root filehandle
- *
- * For option strings, user space handles the following behaviors:
- *
- * + DNS: mapping server host name to IP address ("addr=" option)
- *
- * + failure mode: how to behave if a mount request can't be handled
- *   immediately ("fg/bg" option)
- *
- * + retry: how often to retry a mount request ("retry=" option)
- *
- * + breaking back: trying proto=udp after proto=tcp, v2 after v3,
- *   mountproto=tcp after mountproto=udp, and so on
- */
-static int nfs23_validate_mount_data(void *options,
-				     struct nfs_parsed_mount_data *args,
-				     struct nfs_fh *mntfh,
-				     const char *dev_name)
-{
-	struct nfs_mount_data *data = (struct nfs_mount_data *)options;
-	struct sockaddr *sap = (struct sockaddr *)&args->nfs_server.address;
-	int extra_flags = NFS_MOUNT_LEGACY_INTERFACE;
-
-	if (data == NULL)
-		goto out_no_data;
-
-	args->version = NFS_DEFAULT_VERSION;
-	switch (data->version) {
-	case 1:
-		data->namlen = 0;
-	case 2:
-		data->bsize = 0;
-	case 3:
-		if (data->flags & NFS_MOUNT_VER3)
-			goto out_no_v3;
-		data->root.size = NFS2_FHSIZE;
-		memcpy(data->root.data, data->old_root.data, NFS2_FHSIZE);
-		/* Turn off security negotiation */
-		extra_flags |= NFS_MOUNT_SECFLAVOUR;
-	case 4:
-		if (data->flags & NFS_MOUNT_SECFLAVOUR)
-			goto out_no_sec;
-	case 5:
-		memset(data->context, 0, sizeof(data->context));
-	case 6:
-		if (data->flags & NFS_MOUNT_VER3) {
-			if (data->root.size > NFS3_FHSIZE || data->root.size == 0)
-				goto out_invalid_fh;
-			mntfh->size = data->root.size;
-			args->version = 3;
-		} else {
-			mntfh->size = NFS2_FHSIZE;
-			args->version = 2;
-		}
-
-
-		memcpy(mntfh->data, data->root.data, mntfh->size);
-		if (mntfh->size < sizeof(mntfh->data))
-			memset(mntfh->data + mntfh->size, 0,
-			       sizeof(mntfh->data) - mntfh->size);
-
-		/*
-		 * Translate to nfs_parsed_mount_data, which nfs_fill_super
-		 * can deal with.
-		 */
-		args->flags		= data->flags & NFS_MOUNT_FLAGMASK;
-		args->flags		|= extra_flags;
-		args->rsize		= data->rsize;
-		args->wsize		= data->wsize;
-		args->timeo		= data->timeo;
-		args->retrans		= data->retrans;
-		args->acregmin		= data->acregmin;
-		args->acregmax		= data->acregmax;
-		args->acdirmin		= data->acdirmin;
-		args->acdirmax		= data->acdirmax;
-		args->need_mount	= false;
-
-		memcpy(sap, &data->addr, sizeof(data->addr));
-		args->nfs_server.addrlen = sizeof(data->addr);
-		args->nfs_server.port = ntohs(data->addr.sin_port);
-		if (!nfs_verify_server_address(sap))
-			goto out_no_address;
-
-		if (!(data->flags & NFS_MOUNT_TCP))
-			args->nfs_server.protocol = XPRT_TRANSPORT_UDP;
-		/* N.B. caller will free nfs_server.hostname in all cases */
-		args->nfs_server.hostname = kstrdup(data->hostname, GFP_KERNEL);
-		args->namlen		= data->namlen;
-		args->bsize		= data->bsize;
-
-		if (data->flags & NFS_MOUNT_SECFLAVOUR)
-			args->selected_flavor = data->pseudoflavor;
-		else
-			args->selected_flavor = RPC_AUTH_UNIX;
-		if (!args->nfs_server.hostname)
-			goto out_nomem;
-
-		if (!(data->flags & NFS_MOUNT_NONLM))
-			args->flags &= ~(NFS_MOUNT_LOCAL_FLOCK|
-					 NFS_MOUNT_LOCAL_FCNTL);
-		else
-			args->flags |= (NFS_MOUNT_LOCAL_FLOCK|
-					NFS_MOUNT_LOCAL_FCNTL);
-		/*
-		 * The legacy version 6 binary mount data from userspace has a
-		 * field used only to transport selinux information into the
-		 * the kernel.  To continue to support that functionality we
-		 * have a touch of selinux knowledge here in the NFS code. The
-		 * userspace code converted context=blah to just blah so we are
-		 * converting back to the full string selinux understands.
-		 */
-		if (data->context[0]){
-#ifdef CONFIG_SECURITY_SELINUX
-			int rc;
-			char *opts_str = kmalloc(sizeof(data->context) + 8, GFP_KERNEL);
-			if (!opts_str)
-				return -ENOMEM;
-			strcpy(opts_str, "context=");
-			data->context[NFS_MAX_CONTEXT_LEN] = '\0';
-			strcat(opts_str, &data->context[0]);
-			rc = security_sb_parse_opts_str(opts_str, &args->lsm_opts);
-			kfree(opts_str);
-			if (rc)
-				return rc;
-#else
-			return -EINVAL;
-#endif
-		}
-
-		break;
-	default:
-		return NFS_TEXT_DATA;
-	}
-
-	return 0;
-
-out_no_data:
-	dfprintk(MOUNT, "NFS: mount program didn't pass any mount data\n");
-	return -EINVAL;
-
-out_no_v3:
-	dfprintk(MOUNT, "NFS: nfs_mount_data version %d does not support v3\n",
-		 data->version);
-	return -EINVAL;
-
-out_no_sec:
-	dfprintk(MOUNT, "NFS: nfs_mount_data version supports only AUTH_SYS\n");
-	return -EINVAL;
-
-out_nomem:
-	dfprintk(MOUNT, "NFS: not enough memory to handle mount options\n");
-	return -ENOMEM;
-
-out_no_address:
-	dfprintk(MOUNT, "NFS: mount program didn't pass remote address\n");
-	return -EINVAL;
-
-out_invalid_fh:
-	dfprintk(MOUNT, "NFS: invalid root filehandle\n");
-	return -EINVAL;
-}
-
-#if IS_ENABLED(CONFIG_NFS_V4)
-static int nfs_validate_mount_data(struct file_system_type *fs_type,
-				   void *options,
-				   struct nfs_parsed_mount_data *args,
-				   struct nfs_fh *mntfh,
-				   const char *dev_name)
-{
-	if (fs_type == &nfs_fs_type)
-		return nfs23_validate_mount_data(options, args, mntfh, dev_name);
-	return nfs4_validate_mount_data(options, args, dev_name);
-}
-#else
-static int nfs_validate_mount_data(struct file_system_type *fs_type,
-				   void *options,
-				   struct nfs_parsed_mount_data *args,
-				   struct nfs_fh *mntfh,
-				   const char *dev_name)
-{
-	return nfs23_validate_mount_data(options, args, mntfh, dev_name);
-}
-#endif
-
-static int nfs_validate_text_mount_data(void *options,
-					struct nfs_parsed_mount_data *args,
-					const char *dev_name)
-{
-	int port = 0;
-	int max_namelen = PAGE_SIZE;
-	int max_pathlen = NFS_MAXPATHLEN;
-	struct sockaddr *sap = (struct sockaddr *)&args->nfs_server.address;
-
-	if (nfs_parse_mount_options((char *)options, args) == 0)
-		return -EINVAL;
-
-	if (!nfs_verify_server_address(sap))
-		goto out_no_address;
-
-	if (args->version == 4) {
-#if IS_ENABLED(CONFIG_NFS_V4)
-		port = NFS_PORT;
-		max_namelen = NFS4_MAXNAMLEN;
-		max_pathlen = NFS4_MAXPATHLEN;
-		nfs_validate_transport_protocol(args);
-		if (args->nfs_server.protocol == XPRT_TRANSPORT_UDP)
-			goto out_invalid_transport_udp;
-		nfs4_validate_mount_flags(args);
-#else
-		goto out_v4_not_compiled;
-#endif /* CONFIG_NFS_V4 */
-	} else
-		nfs_set_mount_transport_protocol(args);
-
-	nfs_set_port(sap, &args->nfs_server.port, port);
-
-	return nfs_parse_devname(dev_name,
-				   &args->nfs_server.hostname,
-				   max_namelen,
-				   &args->nfs_server.export_path,
-				   max_pathlen);
-
-#if !IS_ENABLED(CONFIG_NFS_V4)
-out_v4_not_compiled:
-	dfprintk(MOUNT, "NFS: NFSv4 is not compiled into kernel\n");
-	return -EPROTONOSUPPORT;
-#else
-out_invalid_transport_udp:
-	dfprintk(MOUNT, "NFSv4: Unsupported transport protocol udp\n");
-	return -EINVAL;
-#endif /* !CONFIG_NFS_V4 */
-
-out_no_address:
-	dfprintk(MOUNT, "NFS: mount program didn't pass remote address\n");
-	return -EINVAL;
-}
 
 #define NFS_REMOUNT_CMP_FLAGMASK ~(NFS_MOUNT_INTR \
 		| NFS_MOUNT_SECURE \
@@ -2207,7 +891,7 @@ static int nfs_validate_text_mount_data(void *options,
 
 static int
 nfs_compare_remount_data(struct nfs_server *nfss,
-			 struct nfs_parsed_mount_data *data)
+			 struct nfs_mount_context *data)
 {
 	if ((data->flags ^ nfss->flags) & NFS_REMOUNT_CMP_FLAGMASK ||
 	    data->rsize != nfss->rsize ||
@@ -2230,15 +914,11 @@ nfs_compare_remount_data(struct nfs_server *nfss,
 	return 0;
 }
 
-int
-nfs_remount(struct super_block *sb, int *flags, char *raw_data)
+int nfs_remount(struct super_block *sb, struct mount_context *mc)
 {
-	int error;
+	struct nfs_mount_context *ctx =
+		container_of(mc, struct nfs_mount_context, mc);
 	struct nfs_server *nfss = sb->s_fs_info;
-	struct nfs_parsed_mount_data *data;
-	struct nfs_mount_data *options = (struct nfs_mount_data *)raw_data;
-	struct nfs4_mount_data *options4 = (struct nfs4_mount_data *)raw_data;
-	u32 nfsvers = nfss->nfs_client->rpc_ops->version;
 
 	sync_filesystem(sb);
 
@@ -2248,60 +928,27 @@ nfs_remount(struct super_block *sb, int *flags, char *raw_data)
 	 * ones were explicitly specified. Fall back to legacy behavior and
 	 * just return success.
 	 */
-	if ((nfsvers == 4 && (!options4 || options4->version == 1)) ||
-	    (nfsvers <= 3 && (!options || (options->version >= 1 &&
-					   options->version <= 6))))
+	if (ctx->skip_remount_option_check)
 		return 0;
 
-	data = kzalloc(sizeof(*data), GFP_KERNEL);
-	if (data == NULL)
-		return -ENOMEM;
-
-	/* fill out struct with values from existing mount */
-	data->flags = nfss->flags;
-	data->rsize = nfss->rsize;
-	data->wsize = nfss->wsize;
-	data->retrans = nfss->client->cl_timeout->to_retries;
-	data->selected_flavor = nfss->client->cl_auth->au_flavor;
-	data->acregmin = nfss->acregmin / HZ;
-	data->acregmax = nfss->acregmax / HZ;
-	data->acdirmin = nfss->acdirmin / HZ;
-	data->acdirmax = nfss->acdirmax / HZ;
-	data->timeo = 10U * nfss->client->cl_timeout->to_initval / HZ;
-	data->nfs_server.port = nfss->port;
-	data->nfs_server.addrlen = nfss->nfs_client->cl_addrlen;
-	data->version = nfsvers;
-	data->minorversion = nfss->nfs_client->cl_minorversion;
-	data->net = current->nsproxy->net_ns;
-	memcpy(&data->nfs_server.address, &nfss->nfs_client->cl_addr,
-		data->nfs_server.addrlen);
-
-	/* overwrite those values with any that were specified */
-	error = -EINVAL;
-	if (!nfs_parse_mount_options((char *)options, data))
-		goto out;
-
 	/*
 	 * noac is a special case. It implies -o sync, but that's not
 	 * necessarily reflected in the mtab options. do_remount_sb
 	 * will clear MS_SYNCHRONOUS if -o sync wasn't specified in the
 	 * remount options, so we have to explicitly reset it.
 	 */
-	if (data->flags & NFS_MOUNT_NOAC)
-		*flags |= MS_SYNCHRONOUS;
+	if (ctx->flags & NFS_MOUNT_NOAC)
+		ctx->mc.ms_flags |= MS_SYNCHRONOUS;
 
 	/* compare new mount options with old ones */
-	error = nfs_compare_remount_data(nfss, data);
-out:
-	kfree(data);
-	return error;
+	return nfs_compare_remount_data(nfss, ctx);
 }
 EXPORT_SYMBOL_GPL(nfs_remount);
 
 /*
  * Initialise the common bits of the superblock
  */
-inline void nfs_initialise_sb(struct super_block *sb)
+static inline void nfs_initialise_sb(struct super_block *sb)
 {
 	struct nfs_server *server = NFS_SB(sb);
 
@@ -2319,69 +966,75 @@ inline void nfs_initialise_sb(struct super_block *sb)
 }
 
 /*
- * Finish setting up an NFS2/3 superblock
+ * Finish setting up a cloned NFS2/3/4 superblock
  */
-int nfs_fill_super(struct super_block *sb, struct nfs_mount_info *mount_info)
+static int nfs_clone_super(struct super_block *sb, struct mount_context *mc)
 {
-	struct nfs_parsed_mount_data *data = mount_info->parsed;
+	struct nfs_mount_context *ctx =
+		container_of(mc, struct nfs_mount_context, mc);
+	const struct super_block *old_sb = ctx->clone_data.sb;
 	struct nfs_server *server = NFS_SB(sb);
-	int ret;
 
-	sb->s_blocksize_bits = 0;
-	sb->s_blocksize = 0;
-	sb->s_xattr = server->nfs_client->cl_nfs_mod->xattr;
-	sb->s_op = server->nfs_client->cl_nfs_mod->sops;
-	if (data && data->bsize)
-		sb->s_blocksize = nfs_block_size(data->bsize, &sb->s_blocksize_bits);
+	sb->s_blocksize_bits = old_sb->s_blocksize_bits;
+	sb->s_blocksize = old_sb->s_blocksize;
+	sb->s_maxbytes = old_sb->s_maxbytes;
+	sb->s_xattr = old_sb->s_xattr;
+	sb->s_op = old_sb->s_op;
+	sb->s_time_gran = 1;
 
 	if (server->nfs_client->rpc_ops->version != 2) {
 		/* The VFS shouldn't apply the umask to mode bits. We will do
 		 * so ourselves when necessary.
 		 */
 		sb->s_flags |= MS_POSIXACL;
-		sb->s_time_gran = 1;
 	}
 
  	nfs_initialise_sb(sb);
 
-	ret = super_setup_bdi_name(sb, "%u:%u", MAJOR(server->s_dev),
-				   MINOR(server->s_dev));
-	if (ret)
-		return ret;
-	sb->s_bdi->ra_pages = server->rpages * NFS_MAX_READAHEAD;
-	return 0;
+	sb->s_bdi = bdi_get(old_sb->s_bdi);
 
+	return 0;
 }
-EXPORT_SYMBOL_GPL(nfs_fill_super);
 
 /*
- * Finish setting up a cloned NFS2/3/4 superblock
+ * Finish setting up an NFS2/3 superblock
  */
-int nfs_clone_super(struct super_block *sb, struct nfs_mount_info *mount_info)
+int nfs_fill_super(struct super_block *sb, struct mount_context *mc)
 {
-	const struct super_block *old_sb = mount_info->cloned->sb;
+	struct nfs_mount_context *ctx =
+		container_of(mc, struct nfs_mount_context, mc);
 	struct nfs_server *server = NFS_SB(sb);
+	int ret;
 
-	sb->s_blocksize_bits = old_sb->s_blocksize_bits;
-	sb->s_blocksize = old_sb->s_blocksize;
-	sb->s_maxbytes = old_sb->s_maxbytes;
-	sb->s_xattr = old_sb->s_xattr;
-	sb->s_op = old_sb->s_op;
-	sb->s_time_gran = 1;
+	if (ctx->clone_data.sb)
+		return nfs_clone_super(sb, mc);
+	
+	sb->s_blocksize_bits = 0;
+	sb->s_blocksize = 0;
+	sb->s_xattr = server->nfs_client->cl_nfs_mod->xattr;
+	sb->s_op = server->nfs_client->cl_nfs_mod->sops;
+	if (ctx->bsize)
+		sb->s_blocksize = nfs_block_size(ctx->bsize, &sb->s_blocksize_bits);
 
 	if (server->nfs_client->rpc_ops->version != 2) {
 		/* The VFS shouldn't apply the umask to mode bits. We will do
 		 * so ourselves when necessary.
 		 */
 		sb->s_flags |= MS_POSIXACL;
+		sb->s_time_gran = 1;
 	}
 
  	nfs_initialise_sb(sb);
 
-	sb->s_bdi = bdi_get(old_sb->s_bdi);
-
+	ret = super_setup_bdi_name(sb, "%u:%u", MAJOR(server->s_dev),
+				   MINOR(server->s_dev));
+	if (ret)
+		return ret;
+	sb->s_bdi->ra_pages = server->rpages * NFS_MAX_READAHEAD;
 	return 0;
+
 }
+EXPORT_SYMBOL_GPL(nfs_fill_super);
 
 static int nfs_compare_mount_options(const struct super_block *s, const struct nfs_server *b, int flags)
 {
@@ -2495,8 +1148,7 @@ static int nfs_compare_super(struct super_block *sb, void *data)
 
 #ifdef CONFIG_NFS_FSCACHE
 static void nfs_get_cache_cookie(struct super_block *sb,
-				 struct nfs_parsed_mount_data *parsed,
-				 struct nfs_clone_mount *cloned)
+				 struct nfs_mount_context *ctx)
 {
 	struct nfs_server *nfss = NFS_SB(sb);
 	char *uniq = NULL;
@@ -2505,21 +1157,21 @@ static void nfs_get_cache_cookie(struct super_block *sb,
 	nfss->fscache_key = NULL;
 	nfss->fscache = NULL;
 
-	if (parsed) {
-		if (!(parsed->options & NFS_OPTION_FSCACHE))
+	if (ctx) {
+		if (!(ctx->options & NFS_OPTION_FSCACHE))
 			return;
-		if (parsed->fscache_uniq) {
-			uniq = parsed->fscache_uniq;
-			ulen = strlen(parsed->fscache_uniq);
+		if (ctx->fscache_uniq) {
+			uniq = ctx->fscache_uniq;
+			ulen = strlen(ctx->fscache_uniq);
 		}
-	} else if (cloned) {
-		struct nfs_server *mnt_s = NFS_SB(cloned->sb);
+	} else if (ctx->clone_data.cloned) {
+		struct nfs_server *mnt_s = NFS_SB(ctx->clone_data.sb);
 		if (!(mnt_s->options & NFS_OPTION_FSCACHE))
 			return;
 		if (mnt_s->fscache_key) {
 			uniq = mnt_s->fscache_key->key.uniquifier;
 			ulen = mnt_s->fscache_key->key.uniq_len;
-		};
+		}
 	} else
 		return;
 
@@ -2527,22 +1179,22 @@ static void nfs_get_cache_cookie(struct super_block *sb,
 }
 #else
 static void nfs_get_cache_cookie(struct super_block *sb,
-				 struct nfs_parsed_mount_data *parsed,
-				 struct nfs_clone_mount *cloned)
+				 struct nfs_mount_context *ctx)
 {
 }
 #endif
 
 int nfs_set_sb_security(struct super_block *s, struct dentry *mntroot,
-			struct nfs_mount_info *mount_info)
+			struct nfs_mount_context *ctx)
 {
 	int error;
 	unsigned long kflags = 0, kflags_out = 0;
+
 	if (NFS_SB(s)->caps & NFS_CAP_SECURITY_LABEL)
 		kflags |= SECURITY_LSM_NATIVE_LABELS;
 
-	error = security_sb_set_mnt_opts(s, &mount_info->parsed->lsm_opts,
-						kflags, &kflags_out);
+	error = security_sb_set_mnt_opts(s, ctx->mc.security,
+					 kflags, &kflags_out);
 	if (error)
 		goto err;
 
@@ -2555,25 +1207,23 @@ int nfs_set_sb_security(struct super_block *s, struct dentry *mntroot,
 EXPORT_SYMBOL_GPL(nfs_set_sb_security);
 
 int nfs_clone_sb_security(struct super_block *s, struct dentry *mntroot,
-			  struct nfs_mount_info *mount_info)
+			  struct nfs_mount_context *ctx)
 {
 	/* clone any lsm security options from the parent to the new sb */
 	if (d_inode(mntroot)->i_op != NFS_SB(s)->nfs_client->rpc_ops->dir_inode_ops)
 		return -ESTALE;
-	return security_sb_clone_mnt_opts(mount_info->cloned->sb, s);
+	return security_sb_clone_mnt_opts(ctx->clone_data.sb, s);
 }
 EXPORT_SYMBOL_GPL(nfs_clone_sb_security);
 
 struct dentry *nfs_fs_mount_common(struct nfs_server *server,
-				   int flags, const char *dev_name,
-				   struct nfs_mount_info *mount_info,
-				   struct nfs_subversion *nfs_mod)
+				   struct nfs_mount_context *ctx)
 {
 	struct super_block *s;
 	struct dentry *mntroot = ERR_PTR(-ENOMEM);
 	int (*compare_super)(struct super_block *, void *) = nfs_compare_super;
 	struct nfs_sb_mountdata sb_mntdata = {
-		.mntflags = flags,
+		.mntflags = ctx->mc.ms_flags,
 		.server = server,
 	};
 	int error;
@@ -2585,14 +1235,16 @@ struct dentry *nfs_fs_mount_common(struct nfs_server *server,
 	if (server->flags & NFS_MOUNT_NOAC)
 		sb_mntdata.mntflags |= MS_SYNCHRONOUS;
 
-	if (mount_info->cloned != NULL && mount_info->cloned->sb != NULL)
-		if (mount_info->cloned->sb->s_flags & MS_SYNCHRONOUS)
+	if (ctx->clone_data.cloned && ctx->clone_data.sb != NULL)
+		if (ctx->clone_data.sb->s_flags & MS_SYNCHRONOUS)
 			sb_mntdata.mntflags |= MS_SYNCHRONOUS;
 
 	/* Get a superblock - note that we may end up sharing one that already exists */
-	s = sget(nfs_mod->nfs_fs, compare_super, nfs_set_super, flags, &sb_mntdata);
+	s = sget(ctx->nfs_mod->nfs_fs, compare_super, nfs_set_super, ctx->mc.ms_flags,
+		 &sb_mntdata);
 	if (IS_ERR(s)) {
 		mntroot = ERR_CAST(s);
+		ctx->mc.error = "NFS: Couldn't get superblock";
 		goto out_err_nosb;
 	}
 
@@ -2605,17 +1257,17 @@ struct dentry *nfs_fs_mount_common(struct nfs_server *server,
 
 	if (!s->s_root) {
 		/* initial superblock/root creation */
-		error = mount_info->fill_super(s, mount_info);
-		if (error)
-			goto error_splat_super;
-		nfs_get_cache_cookie(s, mount_info->parsed, mount_info->cloned);
+		ctx->mc.ops->fill_super(s, &ctx->mc);
+		nfs_get_cache_cookie(s, ctx);
 	}
 
-	mntroot = nfs_get_root(s, mount_info->mntfh, dev_name);
-	if (IS_ERR(mntroot))
+	mntroot = nfs_get_root(s, ctx->mntfh, ctx->mc.device);
+	if (IS_ERR(mntroot)) {
+		ctx->mc.error = "NFS: Couldn't get root dentry";
 		goto error_splat_super;
+	}
 
-	error = mount_info->set_security(s, mntroot, mount_info);
+	error = ctx->set_security(s, mntroot, ctx);
 	if (error)
 		goto error_splat_root;
 
@@ -2637,47 +1289,6 @@ struct dentry *nfs_fs_mount_common(struct nfs_server *server,
 }
 EXPORT_SYMBOL_GPL(nfs_fs_mount_common);
 
-struct dentry *nfs_fs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *raw_data)
-{
-	struct nfs_mount_info mount_info = {
-		.fill_super = nfs_fill_super,
-		.set_security = nfs_set_sb_security,
-	};
-	struct dentry *mntroot = ERR_PTR(-ENOMEM);
-	struct nfs_subversion *nfs_mod;
-	int error;
-
-	mount_info.parsed = nfs_alloc_parsed_mount_data();
-	mount_info.mntfh = nfs_alloc_fhandle();
-	if (mount_info.parsed == NULL || mount_info.mntfh == NULL)
-		goto out;
-
-	/* Validate the mount data */
-	error = nfs_validate_mount_data(fs_type, raw_data, mount_info.parsed, mount_info.mntfh, dev_name);
-	if (error == NFS_TEXT_DATA)
-		error = nfs_validate_text_mount_data(raw_data, mount_info.parsed, dev_name);
-	if (error < 0) {
-		mntroot = ERR_PTR(error);
-		goto out;
-	}
-
-	nfs_mod = get_nfs_version(mount_info.parsed->version);
-	if (IS_ERR(nfs_mod)) {
-		mntroot = ERR_CAST(nfs_mod);
-		goto out;
-	}
-
-	mntroot = nfs_mod->rpc_ops->try_mount(flags, dev_name, &mount_info, nfs_mod);
-
-	put_nfs_version(nfs_mod);
-out:
-	nfs_free_parsed_mount_data(mount_info.parsed);
-	nfs_free_fhandle(mount_info.mntfh);
-	return mntroot;
-}
-EXPORT_SYMBOL_GPL(nfs_fs_mount);
-
 /*
  * Destroy an NFS2/3 superblock
  */
@@ -2695,150 +1306,8 @@ void nfs_kill_super(struct super_block *s)
 }
 EXPORT_SYMBOL_GPL(nfs_kill_super);
 
-/*
- * Clone an NFS2/3/4 server record on xdev traversal (FSID-change)
- */
-static struct dentry *
-nfs_xdev_mount(struct file_system_type *fs_type, int flags,
-		const char *dev_name, void *raw_data)
-{
-	struct nfs_clone_mount *data = raw_data;
-	struct nfs_mount_info mount_info = {
-		.fill_super = nfs_clone_super,
-		.set_security = nfs_clone_sb_security,
-		.cloned = data,
-	};
-	struct nfs_server *server;
-	struct dentry *mntroot = ERR_PTR(-ENOMEM);
-	struct nfs_subversion *nfs_mod = NFS_SB(data->sb)->nfs_client->cl_nfs_mod;
-
-	dprintk("--> nfs_xdev_mount()\n");
-
-	mount_info.mntfh = mount_info.cloned->fh;
-
-	/* create a new volume representation */
-	server = nfs_mod->rpc_ops->clone_server(NFS_SB(data->sb), data->fh, data->fattr, data->authflavor);
-
-	if (IS_ERR(server))
-		mntroot = ERR_CAST(server);
-	else
-		mntroot = nfs_fs_mount_common(server, flags,
-				dev_name, &mount_info, nfs_mod);
-
-	dprintk("<-- nfs_xdev_mount() = %ld\n",
-			IS_ERR(mntroot) ? PTR_ERR(mntroot) : 0L);
-	return mntroot;
-}
-
 #if IS_ENABLED(CONFIG_NFS_V4)
 
-static void nfs4_validate_mount_flags(struct nfs_parsed_mount_data *args)
-{
-	args->flags &= ~(NFS_MOUNT_NONLM|NFS_MOUNT_NOACL|NFS_MOUNT_VER3|
-			 NFS_MOUNT_LOCAL_FLOCK|NFS_MOUNT_LOCAL_FCNTL);
-}
-
-/*
- * Validate NFSv4 mount options
- */
-static int nfs4_validate_mount_data(void *options,
-				    struct nfs_parsed_mount_data *args,
-				    const char *dev_name)
-{
-	struct sockaddr *sap = (struct sockaddr *)&args->nfs_server.address;
-	struct nfs4_mount_data *data = (struct nfs4_mount_data *)options;
-	char *c;
-
-	if (data == NULL)
-		goto out_no_data;
-
-	args->version = 4;
-
-	switch (data->version) {
-	case 1:
-		if (data->host_addrlen > sizeof(args->nfs_server.address))
-			goto out_no_address;
-		if (data->host_addrlen == 0)
-			goto out_no_address;
-		args->nfs_server.addrlen = data->host_addrlen;
-		if (copy_from_user(sap, data->host_addr, data->host_addrlen))
-			return -EFAULT;
-		if (!nfs_verify_server_address(sap))
-			goto out_no_address;
-		args->nfs_server.port = ntohs(((struct sockaddr_in *)sap)->sin_port);
-
-		if (data->auth_flavourlen) {
-			rpc_authflavor_t pseudoflavor;
-			if (data->auth_flavourlen > 1)
-				goto out_inval_auth;
-			if (copy_from_user(&pseudoflavor,
-					   data->auth_flavours,
-					   sizeof(pseudoflavor)))
-				return -EFAULT;
-			args->selected_flavor = pseudoflavor;
-		} else
-			args->selected_flavor = RPC_AUTH_UNIX;
-
-		c = strndup_user(data->hostname.data, NFS4_MAXNAMLEN);
-		if (IS_ERR(c))
-			return PTR_ERR(c);
-		args->nfs_server.hostname = c;
-
-		c = strndup_user(data->mnt_path.data, NFS4_MAXPATHLEN);
-		if (IS_ERR(c))
-			return PTR_ERR(c);
-		args->nfs_server.export_path = c;
-		dfprintk(MOUNT, "NFS: MNTPATH: '%s'\n", c);
-
-		c = strndup_user(data->client_addr.data, 16);
-		if (IS_ERR(c))
-			return PTR_ERR(c);
-		args->client_address = c;
-
-		/*
-		 * Translate to nfs_parsed_mount_data, which nfs4_fill_super
-		 * can deal with.
-		 */
-
-		args->flags	= data->flags & NFS4_MOUNT_FLAGMASK;
-		args->rsize	= data->rsize;
-		args->wsize	= data->wsize;
-		args->timeo	= data->timeo;
-		args->retrans	= data->retrans;
-		args->acregmin	= data->acregmin;
-		args->acregmax	= data->acregmax;
-		args->acdirmin	= data->acdirmin;
-		args->acdirmax	= data->acdirmax;
-		args->nfs_server.protocol = data->proto;
-		nfs_validate_transport_protocol(args);
-		if (args->nfs_server.protocol == XPRT_TRANSPORT_UDP)
-			goto out_invalid_transport_udp;
-
-		break;
-	default:
-		return NFS_TEXT_DATA;
-	}
-
-	return 0;
-
-out_no_data:
-	dfprintk(MOUNT, "NFS4: mount program didn't pass any mount data\n");
-	return -EINVAL;
-
-out_inval_auth:
-	dfprintk(MOUNT, "NFS4: Invalid number of RPC auth flavours %d\n",
-		 data->auth_flavourlen);
-	return -EINVAL;
-
-out_no_address:
-	dfprintk(MOUNT, "NFS4: mount program didn't pass remote address\n");
-	return -EINVAL;
-
-out_invalid_transport_udp:
-	dfprintk(MOUNT, "NFSv4: Unsupported transport protocol udp\n");
-	return -EINVAL;
-}
-
 /*
  * NFS v4 module parameters need to stay in the
  * NFS client for backwards compatibility
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 348f7c158084..ca0e3793cb50 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1540,6 +1540,7 @@ struct nfs_subversion;
 struct nfs_mount_info;
 struct nfs_client_initdata;
 struct nfs_pageio_descriptor;
+struct nfs_mount_context;
 
 /*
  * RPC procedure vector for NFSv2/NFSv3 demuxing
@@ -1553,10 +1554,10 @@ struct nfs_rpc_ops {
 
 	int	(*getroot) (struct nfs_server *, struct nfs_fh *,
 			    struct nfs_fsinfo *);
+	struct dentry *(*mount)(struct nfs_mount_context *);
 	struct vfsmount *(*submount) (struct nfs_server *, struct dentry *,
 				      struct nfs_fh *, struct nfs_fattr *);
-	struct dentry *(*try_mount) (int, const char *, struct nfs_mount_info *,
-				     struct nfs_subversion *);
+	struct dentry *(*try_mount) (struct nfs_mount_context *);
 	int	(*getattr) (struct nfs_server *, struct nfs_fh *,
 			    struct nfs_fattr *, struct nfs4_label *);
 	int	(*setattr) (struct dentry *, struct nfs_fattr *,
@@ -1617,7 +1618,7 @@ struct nfs_rpc_ops {
 	struct nfs_client *(*init_client) (struct nfs_client *,
 				const struct nfs_client_initdata *);
 	void	(*free_client) (struct nfs_client *);
-	struct nfs_server *(*create_server)(struct nfs_mount_info *, struct nfs_subversion *);
+	struct nfs_server *(*create_server)(struct nfs_mount_context *);
 	struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *,
 					   struct nfs_fattr *, rpc_authflavor_t);
 };

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH 0/9] VFS: Introduce mount context
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (8 preceding siblings ...)
  2017-05-03 16:05 ` [PATCH 9/9] NFS: Support the mount context and fsopen() David Howells
@ 2017-05-03 16:44 ` Jeff Layton
  2017-05-03 16:50 ` David Howells
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 66+ messages in thread
From: Jeff Layton @ 2017-05-03 16:44 UTC (permalink / raw)
  To: David Howells, viro; +Cc: linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Wed, 2017-05-03 at 17:04 +0100, David Howells wrote:
> Here are a set of patches to create a mount context prior to setting up a
> new mount, populating it with the parsed options/binary data and then
> effecting the mount.
> 
> This allows namespaces and other information to be conveyed through the
> mount procedure.  It also allows extra error information to be returned
> (so many things can go wrong during a mount that a small integer isn't
> really sufficient to convey the issue).
> 
> This also allows Miklós Szeredi's idea of doing:
> 
> 	fd = fsopen("nfs");
> 	write(fd, "option=val", ...);
> 	fsmount(fd, "/mnt");
> 
> that he presented at LSF-2017 to be implemented (see the relevant patches
> in the series), to which I can add:
> 
> 	read(fd, error_buffer, ...);
> 
> to read back any error message.  I didn't use netlink as that would make it
> depend on CONFIG_NET and would introduce network namespacing issues.
> 

Nice work!

> I've implemented mount context handling for procfs and nfs.
> 
> Further developments:
> 
>  (*) Implement mount context support in more filesystems, ext4 being next
>      on my list.
> 
>  (*) Move the walk-from-root stuff that nfs has to generic code so that you
>      can do something akin to:
> 
> 	mount /dev/sda1:/foo/bar /mnt
> 
>      See nfs_follow_remote_path() and mount_subtree().  This is slightly
>      tricky in NFS as we have to prevent referral loops.
> 

':' is a legitimate character in a path component. How will you
distinguish that case?

>  (*) Move the pid_ns pointer from struct mount_context to struct
>      proc_mount_context as I'm not sure it's necessary for anything other
>      than procfs.
> 


>  (*) Work out how to get at the error message incurred by submounts
>      encountered during nfs_follow_remote_path().
> 
>      Should the error message be moved to task_struct and made more
>      general, perhaps retrieved with a prctl() function?
> 

Now that's an interesting idea.

>  (*) Clean up/consolidate the security functions.  Possibly add a
>      validation hook to be called at the same time as the mount context
>      validate op.
> 
> The patches can be found here also:
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=mount-context
> 
> David
> ---
> David Howells (9):
>       Provide a function to create a NUL-terminated string from unterminated data
>       Clean up whitespace in fs/namespace.c
>       VFS: Introduce a mount context
>       Implement fsopen() to prepare for a mount
>       Implement fsmount() to effect a pre-configured mount
>       Sample program for driving fsopen/fsmount
>       procfs: Move proc_fill_super() to fs/proc/root.c
>       proc: Support the mount context in procfs
>       NFS: Support the mount context and fsopen()
> 
> 
>  Documentation/filesystems/mounting.txt |  445 ++++++++
>  arch/x86/entry/syscalls/syscall_32.tbl |    2 
>  arch/x86/entry/syscalls/syscall_64.tbl |    2 
>  fs/Makefile                            |    3 
>  fs/fsopen.c                            |  295 +++++
>  fs/internal.h                          |    2 
>  fs/mount.h                             |    3 
>  fs/mount_context.c                     |  343 ++++++
>  fs/namespace.c                         |  367 ++++++-
>  fs/nfs/Makefile                        |    2 
>  fs/nfs/client.c                        |   18 
>  fs/nfs/internal.h                      |  127 +-
>  fs/nfs/mount.c                         | 1539 ++++++++++++++++++++++++++++
>  fs/nfs/namespace.c                     |   75 +
>  fs/nfs/nfs3_fs.h                       |    2 
>  fs/nfs/nfs3client.c                    |    6 
>  fs/nfs/nfs3proc.c                      |    1 
>  fs/nfs/nfs4_fs.h                       |    4 
>  fs/nfs/nfs4client.c                    |   80 +
>  fs/nfs/nfs4namespace.c                 |  207 ++--
>  fs/nfs/nfs4proc.c                      |    1 
>  fs/nfs/nfs4super.c                     |  184 ++-
>  fs/nfs/proc.c                          |    1 
>  fs/nfs/super.c                         | 1729 ++------------------------------
>  fs/proc/inode.c                        |   50 -
>  fs/proc/internal.h                     |    6 
>  fs/proc/root.c                         |  194 +++-
>  fs/super.c                             |   50 +
>  include/linux/fs.h                     |   11 
>  include/linux/lsm_hooks.h              |   43 +
>  include/linux/mount.h                  |   67 +
>  include/linux/nfs_xdr.h                |    7 
>  include/linux/security.h               |   35 +
>  include/linux/string.h                 |    1 
>  include/linux/syscalls.h               |    2 
>  include/uapi/linux/magic.h             |    1 
>  kernel/sys_ni.c                        |    4 
>  mm/util.c                              |   22 
>  samples/fsmount/test-fsmount.c         |   79 +
>  security/security.c                    |   39 +
>  security/selinux/hooks.c               |  192 ++++
>  41 files changed, 4148 insertions(+), 2093 deletions(-)
>  create mode 100644 Documentation/filesystems/mounting.txt
>  create mode 100644 fs/fsopen.c
>  create mode 100644 fs/mount_context.c
>  create mode 100644 fs/nfs/mount.c
>  create mode 100644 samples/fsmount/test-fsmount.c
> 

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH 0/9] VFS: Introduce mount context
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (9 preceding siblings ...)
  2017-05-03 16:44 ` [RFC][PATCH 0/9] VFS: Introduce mount context Jeff Layton
@ 2017-05-03 16:50 ` David Howells
  2017-05-03 17:27   ` Jeff Layton
  2017-05-05 14:35 ` Miklos Szeredi
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 66+ messages in thread
From: David Howells @ 2017-05-03 16:50 UTC (permalink / raw)
  To: Jeff Layton
  Cc: dhowells, viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

Jeff Layton <jlayton@poochiereds.net> wrote:

> >  (*) Move the walk-from-root stuff that nfs has to generic code so that you
> >      can do something akin to:
> > 
> > 	mount /dev/sda1:/foo/bar /mnt
> > 
> >      See nfs_follow_remote_path() and mount_subtree().  This is slightly
> >      tricky in NFS as we have to prevent referral loops.
> > 
> 
> ':' is a legitimate character in a path component. How will you
> distinguish that case?

Fair point.  Could instead do something like:

	mount /dev/sda1 /mnt -o subroot=/foo/bar

or just limit it to the fsopen interface.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/9] Provide a function to create a NUL-terminated string from unterminated data
  2017-05-03 16:04 ` [PATCH 1/9] Provide a function to create a NUL-terminated string from unterminated data David Howells
@ 2017-05-03 16:55   ` Jeff Layton
  2017-05-03 19:26   ` Rasmus Villemoes
  2017-05-03 20:13   ` David Howells
  2 siblings, 0 replies; 66+ messages in thread
From: Jeff Layton @ 2017-05-03 16:55 UTC (permalink / raw)
  To: David Howells, viro; +Cc: linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Wed, 2017-05-03 at 17:04 +0100, David Howells wrote:
> Provide a function, kstrcreate(), that will create a NUL-terminated string
> from an unterminated character array where the length is known in advance.
> 
> This is better than kstrndup() in situations where we already know the
> string length as the strnlen() in kstrndup() is superfluous.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  include/linux/string.h |    1 +
>  mm/util.c              |   22 ++++++++++++++++++++++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/include/linux/string.h b/include/linux/string.h
> index 26b6f6a66f83..5596ae56ce0a 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -122,6 +122,7 @@ extern void kfree_const(const void *x);
>  extern char *kstrdup(const char *s, gfp_t gfp) __malloc;
>  extern const char *kstrdup_const(const char *s, gfp_t gfp);
>  extern char *kstrndup(const char *s, size_t len, gfp_t gfp);
> +extern char *kstrcreate(const char *s, size_t len, gfp_t gfp);
>  extern void *kmemdup(const void *src, size_t len, gfp_t gfp);
>  
>  extern char **argv_split(gfp_t gfp, const char *str, int *argcp);
> diff --git a/mm/util.c b/mm/util.c
> index 656dc5e37a87..01887bbdb11e 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -103,6 +103,28 @@ char *kstrndup(const char *s, size_t max, gfp_t gfp)
>  EXPORT_SYMBOL(kstrndup);
>  
>  /**
> + * kstrcreate - Create a NUL-terminated string from unterminated data
> + * @s: The data to stringify
> + * @len: The size of the data
> + * @gfp: the GFP mask used in the kmalloc() call when allocating memory
> + */
> +char *kstrcreate(const char *s, size_t len, gfp_t gfp)
> +{
> +	char *buf;
> +
> +	if (!s)
> +		return NULL;
> +
> +	buf = kmalloc_track_caller(len + 1, gfp);
> +	if (buf) {
> +		memcpy(buf, s, len);
> +		buf[len] = '\0';
> +	}
> +	return buf;
> +}
> +EXPORT_SYMBOL(kstrcreate);
> +
> +/**
>   * kmemdup - duplicate region of memory
>   *
>   * @src: memory region to duplicate
> 
> 

I haven't gotten to the part where this gets used yet, but it looks like
a nice helper.

Reviewed-by: Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH 0/9] VFS: Introduce mount context
  2017-05-03 16:50 ` David Howells
@ 2017-05-03 17:27   ` Jeff Layton
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff Layton @ 2017-05-03 17:27 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Wed, 2017-05-03 at 17:50 +0100, David Howells wrote:
> Jeff Layton <jlayton@poochiereds.net> wrote:
> 
> > >  (*) Move the walk-from-root stuff that nfs has to generic code so that you
> > >      can do something akin to:
> > > 
> > > 	mount /dev/sda1:/foo/bar /mnt
> > > 
> > >      See nfs_follow_remote_path() and mount_subtree().  This is slightly
> > >      tricky in NFS as we have to prevent referral loops.
> > > 
> > 
> > ':' is a legitimate character in a path component. How will you
> > distinguish that case?
> 
> Fair point.  Could instead do something like:
> 
> 	mount /dev/sda1 /mnt -o subroot=/foo/bar
> 
> or just limit it to the fsopen interface.
> 
> 

Yeah, something like that would certainly work. I like the basic idea
though of combining a mount and bind mount for local fs'.
-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 16:04 ` [PATCH 3/9] VFS: Introduce a mount context David Howells
@ 2017-05-03 18:13   ` Jeff Layton
  2017-05-03 18:26     ` Joe Perches
                       ` (2 more replies)
  2017-05-03 21:43   ` Rasmus Villemoes
                     ` (3 subsequent siblings)
  4 siblings, 3 replies; 66+ messages in thread
From: Jeff Layton @ 2017-05-03 18:13 UTC (permalink / raw)
  To: David Howells, viro; +Cc: linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Wed, 2017-05-03 at 17:04 +0100, David Howells wrote:
> Introduce a mount context concept.  This is allocated at the beginning of
> the mount procedure and into it is placed:
> 
>  (1) Filesystem type.
> 
>  (2) Namespaces.
> 
>  (3) Device name.
> 
>  (4) Superblock flags (MS_*) and mount flags (MNT_*).
> 
>  (5) Security details.
> 
>  (6) Filesystem-specific data, as set by the mount options.
> 
> It also gives a place in which to hang an error message for later retrieval
> (see the mount-by-fd syscall later in this series).
> 
> Rather than calling fs_type->mount(), a mount_context struct is created and
> fs_type->fsopen() is called to set it up.  fs_type->mc_size says how much
> should be added on to the mount context for the filesystem's use.
> 
> A set of operations have to be set by ->fsopen() to provide freeing,
> duplication, option parsing, binary data parsing, validation, mounting and
> superblock filling.
> 
> It should be noted that, whilst this patch adds a lot of lines of code,
> there is quite a bit of duplication with existing code that can be
> eliminated should all filesystems be converted over.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  Documentation/filesystems/mounting.txt |  445 ++++++++++++++++++++++++++++++++
>  fs/Makefile                            |    3 
>  fs/internal.h                          |    2 
>  fs/mount.h                             |    3 
>  fs/mount_context.c                     |  343 +++++++++++++++++++++++++
>  fs/namespace.c                         |  270 +++++++++++++++++--
>  fs/super.c                             |   50 +++-
>  include/linux/fs.h                     |   11 +
>  include/linux/lsm_hooks.h              |   37 +++
>  include/linux/mount.h                  |   67 +++++
>  include/linux/security.h               |   29 ++
>  security/security.c                    |   32 ++
>  security/selinux/hooks.c               |  179 +++++++++++++
>  13 files changed, 1435 insertions(+), 36 deletions(-)
>  create mode 100644 Documentation/filesystems/mounting.txt
>  create mode 100644 fs/mount_context.c
> 
> diff --git a/Documentation/filesystems/mounting.txt b/Documentation/filesystems/mounting.txt
> new file mode 100644
> index 000000000000..a942ccd08376
> --- /dev/null
> +++ b/Documentation/filesystems/mounting.txt
> @@ -0,0 +1,445 @@
> +			      ===================
> +			      FILESYSTEM MOUNTING
> +			      ===================
> +
> +CONTENTS
> +
> + (1) Overview.
> +
> + (2) The mount context.
> +
> + (3) The mount context operations.
> +
> + (4) Mount context security.
> +
> + (5) VFS mount context operations.
> +
> +
> +========
> +OVERVIEW
> +========
> +
> +The creation of new mounts is now to be done in a multistep process:
> +
> + (1) Create a mount context.
> +
> + (2) Parse the options and attach them to the mount context.  Options may be
> +     passed individually from userspace.
> +
> + (3) Validate and pre-process the mount context.
> +
> + (4) Perform the mount.
> +
> + (5) Return an error message attached to the mount context.
> +
> + (6) Destroy the mount context.
> +
> +To support this, the file_system_type struct gains two new fields:
> +
> +	unsigned short mc_size;
> +
> +which indicates how much space the filesystem would like tacked onto the end of
> +the mount_context struct for its own purposes, and:
> +
> +	int (*fsopen)(struct mount_context *mc, struct super_block *src_sb);
> +
> +which is invoked to set up the filesystem-specific parts of a mount context,
> +including the additional space.  The src_sb parameter is used to convey the
> +superblock from which the filesystem may draw extra information (such as
> +namespaces), for submount (MS_SUBMOUNT) or remount (MS_REMOUNT) purposes or it
> +will be NULL.
> +
> +Note that security initialisation is done *after* the filesystem is called so
> +that the namespaces may be adjusted first.
> +
> +And the super_operations struct gains one:
> +
> +	int (*remount_fs_mc) (struct super_block *, struct mount_context *);
> +
> +This shadows the ->remount_fs() operation and takes a prepared mount context
> +instead of the mount flags and data page.  It may modify the ms_flags in the
> +context for the caller to pick up.
> +
> +[NOTE] remount_fs_mc is intended as a replacement for remount_fs.
> +
> +
> +=================
> +THE MOUNT CONTEXT
> +=================
> +
> +The mount process is governed by a mount context.  This is represented by the
> +mount_context structure:
> +
> +	struct mount_context {
> +		const struct mount_context_operations *ops;
> +		struct file_system_type *fs;
> +		struct user_namespace	*user_ns;
> +		struct mnt_namespace	*mnt_ns;
> +		struct pid_namespace	*pid_ns;
> +		struct net		*net_ns;
> +		const struct cred	*cred;
> +		char			*device;
> +		char			*root_path;
> +		void			*security;
> +		const char		*error;
> +		unsigned int		ms_flags;
> +		unsigned int		mnt_flags;
> +		bool			mounted;
> +		bool			sloppy;
> +		bool			silent;
> +		enum mount_type		mount_type : 8;
> +	};
> +
> +When allocated, the mount_context struct is extended by ->mc_size bytes as
> +specified by the specified file_system_type struct.  This is for use by the
> +filesystem.  The filesystem should wrap the struct in its own, e.g.:
> +
> +	struct nfs_mount_context {
> +		struct mount_context mc;
> +		...
> +	};
> +
> +placing the mount_context struct first.  container_of() can then be used.
> +
> +The mount_context fields are as follows:
> +
> + (*) const struct mount_context_operations *ops
> +
> +     These are operations that can be done on a mount context.  See below.
> +     This must be set by the ->fsopen() file_system_type operation.
> +
> + (*) struct file_system_type *fs
> +
> +     A pointer to the file_system_type of the filesystem that is being
> +     mounted.  This retains a ref on the type owner.
> +
> + (*) struct user_namespace *user_ns
> + (*) struct mnt_namespace *mnt_ns
> + (*) struct pid_namespace *pid_ns
> + (*) struct net *net_ns
> +
> +     This is a subset of the namespaces in use by the invoking process.  This
> +     retains a ref on each namespace.  The subscribed namespaces may be
> +     replaced by the filesystem to reflect other sources, such as the parent
> +     mount superblock on an automount.
> +
> + (*) struct cred *cred
> +
> +     The mounter's credentials.  This retains a ref on the credentials.
> +
> + (*) char *device
> +
> +     This is the device to be mounted.  It may be a block device
> +     (e.g. /dev/sda1) or something more exotic, such as the "host:/path" that
> +     NFS desires.
> +
> + (*) char *root_path
> +
> +     A path to the place inside the filesystem to actually mount.  This allows
> +     a mount and bind-mount to be combined.
> +
> +     [NOTE] This isn't implemented yet, but NFS has the code to do this which
> +     could be moved to the VFS.
> +
> + (*) void *security
> +
> +     A place for the LSMs to hang their security data for the mount.  The
> +     relevant security operations are described below.
> +
> + (*) const char *error
> +
> +     A place for the VFS and the filesystem to hang an error message.  This
> +     should be in the form of a static string that doesn't need deallocation
> +     and the pointer to which can just be overwritten.  Under some
> +     circumstances, this can be retrieved by userspace.
> +
> +     Note that the existence of the error string is expected to be guaranteed
> +     by the reference on the file_system_type object held by ->fs or any
> +     filesystem-specific reference held in the filesystem context until the
> +     ->free() operation is called.
> +
> + (*) unsigned int ms_flags
> + (*) unsigned int mnt_flags
> +
> +     These hold the mount flags.  ms_flags holds MS_* flags and mnt_flags holds
> +     MNT_* flags.
> +
> + (*) bool mounted
> +
> +     This is set to true once a mount attempt is made.  This causes an error to
> +     be given on subsequent mount attempts with the same context and prevents
> +     multiple mount attempts.
> +
> + (*) bool sloppy
> + (*) bool silent
> +
> +     These are set if the sloppy or silent mount options are given.
> +
> +     [NOTE] sloppy is probably unnecessary when userspace passes over one
> +     option at a time since the error can just be ignored if userspace deems it
> +     to be unimportant.
> +
> +     [NOTE] silent is probably redundant with ms_flags & MS_SILENT.
> +
> + (*) enum mount_type
> +
> +     This indicates the type of mount operation.  The available values are:
> +
> +	MOUNT_TYPE_NEW		-- New mount
> +	MOUNT_TYPE_SUBMOUNT	-- New automatic submount of extant mount
> +	MOUNT_TYPE_REMOUNT	-- Change an existing mount
> +
> +The mount context is created by calling __vfs_fsopen(), vfs_fsopen(),
> +vfs_mntopen() or vfs_dup_mount_context() and is destroyed with
> +put_mount_context().  Note that the structure is not refcounted.
> +
> +VFS, security and filesystem mount options are set individually with
> +vfs_mount_option() or in bulk with generic_monolithic_mount_data().
> +
> +When mounting, the filesystem is allowed to take data from any of the pointers
> +and attach it to the superblock (or whatever), provided it clears the pointer
> +in the mount context.
> +
> +The filesystem is also allowed to allocate resources and pin them with the
> +mount context.  For instance, NFS might pin the appropriate protocol version
> +module.
> +
> +
> +============================
> +THE MOUNT CONTEXT OPERATIONS
> +============================
> +
> +The mount context points to a table of operations:
> +
> +	struct mount_context_operations {
> +		void (*free)(struct mount_context *mc);
> +		int (*dup)(struct mount_context *mc, struct mount_context *src);
> +		int (*option)(struct mount_context *mc, char *p);
> +		int (*monolithic_mount_data)(struct mount_context *mc, void *data);
> +		int (*validate)(struct mount_context *mc);
> +		struct dentry *(*mount)(struct mount_context *mc);
> +		int (*fill_super)(struct super_block *s, struct mount_context *mc);
> +	};
> +
> +These operations are invoked by the various stages of the mount procedure to
> +manage the mount context.  They are as follows:
> +
> + (*) void (*free)(struct mount_context *mc);
> +
> +     Called to clean up the filesystem-specific part of the mount context when
> +     the context is destroyed.  It should be aware that parts of the context
> +     may have been removed and NULL'd out by ->mount().
> +
> + (*) int (*dup)(struct mount_context *mc, struct mount_context *src);
> +
> +     Called when a mount context has been duplicated to get any refs or copy
> +     any non-referenced resources held in the filesystem-specific part of the
> +     mount context.  An error may be returned to indicate failure to do this.
> +
> +     [!] Note that if this fails, put_mount_context() will be called
> +     	 immediately thereafter, so ->dup() *must* make the
> +     	 filesystem-specific part safe for ->free().
> +
> + (*) int (*option)(struct mount_context *mc, char *p);
> +
> +     Called when an option is to be added to the mount context.  p points to
> +     the option string, likely in "key[=val]" format.  VFS-specific options
> +     will have been weeded out and mc->ms_flags and mc->mnt_flags updated in
> +     the context.  Security options will also have been weeded out and
> +     mc->security updated.
> +
> +     If successful, 0 should be returned and a negative error code otherwise.
> +     If an ambiguous error (such as -EINVAL) is returned, mc->error should be
> +     set in the context to a string that provides more information.
> +
> + (*) int (*monolithic_mount_data)(struct mount_context *mc, void *data);
> +
> +     Called when the mount(2) system call is invoked to pass the entire data
> +     page in one go.  If this is expected to be just a list of "key[=val]"
> +     items separated by commas, then this may be set to NULL.
> +
> +     The return value is as for ->option().
> +
> +     If the filesystem (eg. NFS) needs to examine the data first and then
> +     finds it's the standard key-val list then it may pass it off to:
> +
> +	int generic_monolithic_mount_data(struct mount_context *mc, void *data);
> +
> + (*) int (*validate)(struct mount_context *mc);
> +
> +     Called when all the options have been applied and the mount is about to
> +     take place.  It is should check for inconsistencies from mount options
> +     and it is also allowed to do preliminary resource acquisition.  For
> +     instance, the core NFS module could load the NFS protocol module here.
> +
> +     Note that if mc->mount_type == MOUNT_TYPE_REMOUNT, some of the options
> +     necessary for a new mount may not be set.
> +
> +     The return value is as for ->option().
> +
> + (*) struct dentry *(*mount)(struct mount_context *mc);
> +
> +     Called to effect a new mount or new submount using the information stored
> +     in the mount context (remounts go via a different vector).  It may detach
> +     any resources it desires from the mount context and transfer them to the
> +     superblock it creates.
> +
> +     On success it should return the dentry that's at the root of the mount.
> +     In future, mc->root_path will then be applied to this.
> +
> +     In the case of an error, it should return a negative error code and set
> +     mc->error.
> +
> + (*) int (*fill_super)(struct super_block *s, struct mount_context *mc);
> +
> +     This is available to be used by things like mount_ns_mc() that are called
> +     by ->mount() to transfer information/resources from the mount context to
> +     the superblock.
> +
> +
> +======================
> +MOUNT CONTEXT SECURITY
> +======================
> +
> +The mount context contains a security points that the LSMs can use for
> +building up a security context for the superblock to be mounted.  There are a
> +number of operations used by the new mount code for this purpose:
> +
> + (*) int security_mount_ctx_alloc(struct mount_context *mc,
> +				  struct super_block *src_sb);
> +
> +     Called to initialise mc->security (which is preset to NULL) and allocate
> +     any resources needed.  It should return 0 on success and a negative error
> +     code on failure.
> +
> +     src_sb is non-NULL in the case of a remount (MS_REMOUNT) in which case it
> +     indicates the superblock to be remounted or in the case of a submount
> +     (MS_SUBMOUNT) in which case it indicates the parent superblock.
> +
> + (*) int security_mount_ctx_dup(struct mount_context *mc,
> +				struct mount_context *src_mc);
> +
> +     Called to initialise mc->security (which is preset to NULL) and allocate
> +     any resources needed.  The original mount context is pointed to by src_mc
> +     and may be used for reference.  It should return 0 on success and a
> +     negative error code on failure.
> +
> + (*) void security_mount_ctx_free(struct mount_context *mc);
> +
> +     Called to clean up anything attached to mc->security.  Note that the
> +     contents may have been transferred to a superblock and the pointer NULL'd
> +     out during mount.
> +
> + (*) int security_mount_ctx_option(struct mount_context *mc, char *opt);
> +
> +     Called for each mount option.  The mount options are in "key[=val]"
> +     form.  An active LSM may reject one with an error, pass one over and
> +     return 0 or consume one and return 1.  If consumed, the option isn't
> +     passed on to the filesystem.
> +
> +     If it returns an error, it should set mc->error if the error is
> +     ambiguous.
> +
> + (*) int security_mount_ctx_kern_mount(struct mount_context *mc,
> +				       struct super_block *sb);
> +
> +     Called during mount to verify that the specified superblock is allowed to
> +     be mounted and to transfer the security data there.
> +
> +     On success, it should return 0; otherwise it should return an error and
> +     set mc->error to indicate the problem.  It should not return -ENOMEM as
> +     this should be taken care of in advance.
> +
> +     [NOTE] Should I add a security_mount_ctx_validate() operation so that the
> +     LSM has the opportunity to allocate stuff and check the options as a
> +     whole?
> +
> +
> +============================
> +VFS MOUNT CONTEXT OPERATIONS
> +============================
> +
> +There are four operations for creating a mount context and one for destroying
> +a context:
> +
> + (*) struct mount_context *__vfs_fsopen(struct file_system_type *fs_type,
> +					struct super_block *src_sb;
> +					unsigned int ms_flags,
> +					unsigned int mnt_flags);
> +
> +     Create a mount context given a filesystem type pointer.  This allocates
> +     the mount context, sets the flags, initialises the security and calls
> +     fs_type->fsopen() to initialise the filesystem context.
> +
> +     src_sb can be NULL or it may indicate a superblock that is going to be
> +     remounted (MS_REMOUNT) or a superblock that is the parent of a submount
> +     (MS_SUBMOUNT).  This superblock is provided as a source of namespace
> +     information.
> +
> + (*) struct mount_context *vfs_mntopen(struct vfsmount *mnt,
> +				       unsigned int ms_flags,
> +				       unsigned int mnt_flags);
> +
> +     Create a mount context from the same filesystem as an extant mount and
> +     initialise the mount parameters from the superblock underlying that
> +     mount.  This is used by remount.
> +
> + (*) struct mount_context *vfs_fsopen(const char *fs_name);
> +
> +     Create a mount context given a filesystem name.  It is assumed that the
> +     mount flags will be passed in as text options later.  This is intended to
> +     be called from sys_fsopen().  This copies current's namespaces to the
> +     mount context.
> +
> + (*) struct mount_context *vfs_dup_mount_context(struct mount_context *src);
> +
> +     Duplicate a mount context, copying any options noted and duplicating or
> +     additionally referencing any resources held therein.  This is available
> +     for use where a filesystem has to get a mount within a mount, such as
> +     NFS4 does by internally mounting the root of the target server and then
> +     doing a private pathwalk to the target directory.
> +
> + (*) void put_mount_context(struct mount_context *ctx);
> +
> +     Destroy a mount context, releasing any resources it holds.  This calls
> +     the ->free() operation.  This is intended to be called by anyone who
> +     created a mount context.
> +
> +     [!] Mount contexts are not refcounted, so this causes unconditional
> +     	 destruction.
> +
> +In all the above operations, apart from the put op, the return is a mount
> +context pointer or a negative error code.  No error string is saved as the
> +error string is only guaranteed as long as the file_system_type is pinned (and
> +thus the module).
> +
> +In the remaining operations, if an error occurs, a negative error code is
> +returned and, if not obvious, mc->error should be set to point to a useful
> +string.  The string should not be freed.
> +
> + (*) struct vfsmount *vfs_kern_mount_mc(struct mount_context *mc);
> +
> +     Create a mount given the parameters in the specified mount context.  This
> +     invokes the ->validate() op and then the ->mount() op.
> +
> + (*) struct vfsmount *vfs_submount_mc(const struct dentry *mountpoint,
> +				      struct mount_context *mc);
> +
> +     Create a mount given a mount context and set MS_SUBMOUNT on it.  A
> +     wrapper around vfs_kern_mount_mc().  This is intended to be called from
> +     filesystems that have automount points (NFS, AFS, ...).
> +
> + (*) int vfs_mount_option(struct mount_context *mc, char *data);
> +
> +     Supply a single mount option to the mount context.  The mount option
> +     should likely be in a "key[=val]" string form.  The option is first
> +     checked to see if it corresponds to a standard mount flag (in which case
> +     it is used to mark an MS_xxx flag and consumed) or a security option (in
> +     which case the LSM consumes it) before it is passed on to the filesystem.
> +
> + (*) int generic_monolithic_mount_data(struct mount_context *ctx, void *data);
> +
> +     Parse a sys_mount() data page, assuming the form to be a text list
> +     consisting of key[=val] options separated by commas.  Each item in the
> +     list is passed to vfs_mount_option().  This is the default when the
> +     ->monolithic_mount_data() operation is NULL.
> diff --git a/fs/Makefile b/fs/Makefile
> index 7bbaca9c67b1..308a104a9a07 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -11,7 +11,8 @@ obj-y :=	open.o read_write.o file_table.o super.o \
>  		attr.o bad_inode.o file.o filesystems.o namespace.o \
>  		seq_file.o xattr.o libfs.o fs-writeback.o \
>  		pnode.o splice.o sync.o utimes.o \
> -		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o
> +		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
> +		mount_context.o
>  
>  ifeq ($(CONFIG_BLOCK),y)
>  obj-y +=	buffer.o block_dev.o direct-io.o mpage.o
> diff --git a/fs/internal.h b/fs/internal.h
> index 076751d90ba2..ef8c5e93f364 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -87,7 +87,7 @@ extern struct file *get_empty_filp(void);
>  /*
>   * super.c
>   */
> -extern int do_remount_sb(struct super_block *, int, void *, int);
> +extern int do_remount_sb(struct super_block *, int, void *, int, struct mount_context *);
>  extern bool trylock_super(struct super_block *sb);
>  extern struct dentry *mount_fs(struct file_system_type *,
>  			       int, const char *, void *);
> diff --git a/fs/mount.h b/fs/mount.h
> index 2826543a131d..b1e99b38f2ee 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -108,9 +108,10 @@ static inline void detach_mounts(struct dentry *dentry)
>  	__detach_mounts(dentry);
>  }
>  
> -static inline void get_mnt_ns(struct mnt_namespace *ns)
> +static inline struct mnt_namespace *get_mnt_ns(struct mnt_namespace *ns)
>  {
>  	atomic_inc(&ns->count);
> +	return ns;
>  }
>  
>  extern seqlock_t mount_lock;
> diff --git a/fs/mount_context.c b/fs/mount_context.c
> new file mode 100644
> index 000000000000..7d765c100bf1
> --- /dev/null
> +++ b/fs/mount_context.c
> @@ -0,0 +1,343 @@
> +/* Provide a way to create a mount context within the kernel that can be
> + * configured before mounting.
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/nsproxy.h>
> +#include <linux/slab.h>
> +#include <linux/magic.h>
> +#include <linux/security.h>
> +#include <linux/parser.h>
> +#include <linux/mnt_namespace.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/user_namespace.h>
> +#include <net/net_namespace.h>
> +#include "mount.h"
> +
> +static const match_table_t common_set_mount_options = {
> +	{ MS_DIRSYNC,		"dirsync" },
> +	{ MS_I_VERSION,		"iversion" },
> +	{ MS_LAZYTIME,		"lazytime" },
> +	{ MS_MANDLOCK,		"mand" },
> +	{ MS_NOATIME,		"noatime" },
> +	{ MS_NODEV,		"nodev" },
> +	{ MS_NODIRATIME,	"nodiratime" },
> +	{ MS_NOEXEC,		"noexec" },
> +	{ MS_NOSUID,		"nosuid" },
> +	{ MS_POSIXACL,		"posixacl" },
> +	{ MS_RDONLY,		"ro" },
> +	{ MS_REC,		"rec" },
> +	{ MS_RELATIME,		"relatime" },
> +	{ MS_STRICTATIME,	"strictatime" },
> +	{ MS_SYNCHRONOUS,	"sync" },
> +	{ MS_VERBOSE,		"verbose" },
> +	{ },
> +};
> +
> +static const match_table_t common_clear_mount_options = {
> +	{ MS_LAZYTIME,		"nolazytime" },
> +	{ MS_MANDLOCK,		"nomand" },
> +	{ MS_NODEV,		"dev" },
> +	{ MS_NOEXEC,		"exec" },
> +	{ MS_NOSUID,		"suid" },
> +	{ MS_RDONLY,		"rw" },
> +	{ MS_RELATIME,		"norelatime" },
> +	{ MS_SILENT,		"silent" },
> +	{ MS_STRICTATIME,	"nostrictatime" },
> +	{ MS_SYNCHRONOUS,	"async" },
> +	{ },
> +};
> +
> +static const match_table_t forbidden_mount_options = {
> +	{ MS_BIND,		"bind" },
> +	{ MS_KERNMOUNT,		"ro" },
> +	{ MS_MOVE,		"move" },
> +	{ MS_PRIVATE,		"private" },
> +	{ MS_REMOUNT,		"remount" },
> +	{ MS_SHARED,		"shared" },
> +	{ MS_SLAVE,		"slave" },
> +	{ MS_UNBINDABLE,	"unbindable" },
> +	{ },
> +};
> +
> +/*
> + * Check for a common mount option.
> + */
> +static noinline int vfs_common_mount_option(struct mount_context *mc, char *data)
> +{
> +	substring_t args[MAX_OPT_ARGS];
> +	unsigned int token;
> +
> +	token = match_token(data, common_set_mount_options, args);
> +	if (token) {
> +		mc->ms_flags |= token;
> +		return 1;
> +	}
> +
> +	token = match_token(data, common_clear_mount_options, args);
> +	if (token) {
> +		mc->ms_flags &= ~token;
> +		return 1;
> +	}
> +
> +	token = match_token(data, forbidden_mount_options, args);
> +	if (token) {
> +		mc->error = "Mount option, not superblock option";
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * vfs_mount_option - Add a single mount option to a mount context
> + * @mc: The mount context to modify
> + * @option: The option to apply.
> + *
> + * A single mount option in string form is applied to the mount being set up in
> + * the mount context.  Certain standard options (for example "ro") are
> + * translated into flag bits without going to the filesystem.  The active
> + * security module allowed to observe and poach options.  Any other options are

"module is allowed"

> + * passed over to the filesystem to parse.
> + *
> + * This may be called multiple times for a context.
> + *
> + * Returns 0 on success and a negative error code on failure.  In the event of
> + * failure, mc->error may have been set to a non-allocated string that gives
> + * more information.
> + */
> +int vfs_mount_option(struct mount_context *mc, char *data)
> +{
> +	int ret;
> +
> +	if (mc->mounted)
> +		return -EBUSY;
> +
> +	ret = vfs_common_mount_option(mc, data);
> +	if (ret < 0)
> +		return ret;
> +	if (ret == 1)
> +		return 0;
> +
> +	ret = security_mount_ctx_option(mc, data);
> +	if (ret < 0)
> +		return ret;
> +	if (ret == 1)
> +		return 0;
> +
> +	return mc->ops->option(mc, data);
> +}
> +EXPORT_SYMBOL(vfs_mount_option);
> +
> +/**
> + * generic_monolithic_mount_data - Parse key[=val][,key[=val]]* mount data
> + * @mc: The mount context to populate
> + * @data: The data to parse
> + *
> + * Parse a blob of data that's in key[=val][,key[=val]]* form.  This can be
> + * called from the ->monolithic_mount_data() mount context operation.
> + *
> + * Returns 0 on success or the error returned by the ->option() mount context
> + * operation on failure.
> + */
> +int generic_monolithic_mount_data(struct mount_context *ctx, void *data)
> +{
> +	char *options = data, *p;
> +	int ret;
> +
> +	if (!options)
> +		return 0;
> +
> +	while ((p = strsep(&options, ",")) != NULL) {
> +		if (*p) {
> +			ret = vfs_mount_option(ctx, p);
> +			if (ret < 0)
> +				return ret;
> +		}
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(generic_monolithic_mount_data);
> +
> +/**
> + * __vfs_fsopen - Open a filesystem and create a mount context
> + * @fs_type: The filesystem type
> + * @src_sb: A superblock from which this one derives (or NULL)
> + * @ms_flags: Superblock flags and op flags (such as MS_REMOUNT)
> + * @mnt_flags: Mountpoint flags, such as MNT_READONLY
> + * @mount_type: Type of mount
> + *
> + * Open a filesystem and create a mount context.  The mount context is
> + * initialised with the supplied flags and, if a submount/automount from
> + * another superblock (@src_sb), may have parameters such as namespaces copied
> + * across from that superblock.
> + */
> +struct mount_context *__vfs_fsopen(struct file_system_type *fs_type,
> +				   struct super_block *src_sb,
> +				   unsigned int ms_flags, unsigned int mnt_flags,
> +				   enum mount_type mount_type)
> +{
> +	struct mount_context *mc;
> +	int ret;
> +
> +	if (fs_type->fsopen && fs_type->mc_size < sizeof(*mc))
> +		BUG();
> +
> +	mc = kzalloc(max_t(size_t, fs_type->mc_size, sizeof(*mc)), GFP_KERNEL);
> +	if (!mc)
> +		return ERR_PTR(-ENOMEM);
> +
> +	mc->mount_type = mount_type;
> +	mc->ms_flags = ms_flags;
> +	mc->mnt_flags = mnt_flags;
> +	mc->fs_type = fs_type;
> +	get_filesystem(fs_type);
> +	mc->mnt_ns = get_mnt_ns(current->nsproxy->mnt_ns);
> +	mc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> +	mc->net_ns = get_net(current->nsproxy->net_ns);
> +	mc->user_ns = get_user_ns(current_user_ns());
> +	mc->cred = get_current_cred();
> +
> +
> +	/* TODO: Make all filesystems support this unconditionally */
> +	if (mc->fs_type->fsopen) {
> +		ret = mc->fs_type->fsopen(mc, src_sb);
> +		if (ret < 0)
> +			goto err_mc;
> +	}
> +
> +	/* Do the security check last because ->fsopen may change the
> +	 * namespace subscriptions.
> +	 */
> +	ret = security_mount_ctx_alloc(mc, src_sb);
> +	if (ret < 0)
> +		goto err_mc;
> +
> +	return mc;
> +
> +err_mc:
> +	put_mount_context(mc);
> +	return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL(__vfs_fsopen);
> +
> +/**
> + * vfs_fsopen - Open a filesystem and create a mount context
> + * @fs_name: The name of the filesystem
> + *
> + * Open a filesystem and create a mount context that will hold the mount
> + * options, device name, security details, etc..  Note that the caller should
> + * check the ->ops pointer in the returned context to determine whether the
> + * filesystem actually supports the mount context itself.
> + */
> +struct mount_context *vfs_fsopen(const char *fs_name)
> +{
> +	struct file_system_type *fs_type;
> +	struct mount_context *mc;
> +
> +	fs_type = get_fs_type(fs_name);
> +	if (!fs_type)
> +		return ERR_PTR(-ENODEV);
> +
> +	mc = __vfs_fsopen(fs_type, NULL, 0, 0, MOUNT_TYPE_NEW);
> +	put_filesystem(fs_type);
> +	return mc;
> +}
> +EXPORT_SYMBOL(vfs_fsopen);
> +
> +/**
> + * vfs_mntopen - Create a mount context and initialise it from an extant mount
> + * @mnt: The mountpoint to open
> + * @ms_flags: Superblock flags and op flags (such as MS_REMOUNT)
> + * @mnt_flags: Mountpoint flags, such as MNT_READONLY
> + * @mount_type: Type of mount
> + *
> + * Open a mounted filesystem and create a mount context such that a remount can
> + * be effected.
> + */
> +struct mount_context *vfs_mntopen(struct vfsmount *mnt,
> +				  unsigned int ms_flags,
> +				  unsigned int mnt_flags,
> +				  enum mount_type mount_type)
> +{
> +	return __vfs_fsopen(mnt->mnt_sb->s_type, mnt->mnt_sb,
> +			    ms_flags, mnt_flags, mount_type);
> +}
> +
> +/**
> + * vfs_dup_mount_context: Duplicate a mount context.
> + * @src: The mount context to copy.
> + */
> +struct mount_context *vfs_dup_mount_context(struct mount_context *src)
> +{
> +	struct mount_context *mc;
> +	int ret;
> +
> +	if (!src->ops->dup)
> +		return ERR_PTR(-ENOTSUPP);
> +
> +	mc = kmemdup(src, src->fs_type->mc_size, GFP_KERNEL);
> +	if (!mc)
> +		return ERR_PTR(-ENOMEM);
> +
> +	mc->device	= NULL;
> +	mc->root_path	= NULL;
> +	mc->security	= NULL;
> +	mc->error	= NULL;
> +	get_filesystem(mc->fs_type);
> +	get_mnt_ns(mc->mnt_ns);
> +	get_pid_ns(mc->pid_ns);
> +	get_net(mc->net_ns);
> +	get_user_ns(mc->user_ns);
> +	get_cred(mc->cred);
> +
> +	/* Can't call put until we've called ->dup */
> +	ret = mc->ops->dup(mc, src);
> +	if (ret < 0)
> +		goto err_mc;
> +
> +	ret = security_mount_ctx_dup(mc, src);
> +	if (ret < 0)
> +		goto err_mc;
> +	return mc;
> +
> +err_mc:
> +	put_mount_context(mc);
> +	return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL(vfs_dup_mount_context);
> +
> +/*
> + * Dispose of a mount context.
> + */
> +void put_mount_context(struct mount_context *mc)
> +{
> +	if (mc->ops && mc->ops->free)
> +		mc->ops->free(mc);
> +	security_mount_ctx_free(mc);
> +	if (mc->mnt_ns)
> +		put_mnt_ns(mc->mnt_ns);
> +	if (mc->pid_ns)
> +		put_pid_ns(mc->pid_ns);
> +	if (mc->net_ns)
> +		put_net(mc->net_ns);
> +	put_user_ns(mc->user_ns);
> +	if (mc->cred)
> +		put_cred(mc->cred);
> +	put_filesystem(mc->fs_type);
> +	kfree(mc->device);
> +	kfree(mc->root_path);
> +	kfree(mc);
> +}
> +EXPORT_SYMBOL(put_mount_context);
> diff --git a/fs/namespace.c b/fs/namespace.c
> index db034b6afd43..e0edab9af308 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -25,6 +25,7 @@
>  #include <linux/magic.h>
>  #include <linux/bootmem.h>
>  #include <linux/task_work.h>
> +#include <linux/file.h>
>  #include <linux/sched/task.h>
>  
>  #include "pnode.h"
> @@ -783,9 +784,14 @@ static void put_mountpoint(struct mountpoint *mp)
>  	}
>  }
>  
> +static inline int __check_mnt(struct mount *mnt, struct mnt_namespace *mnt_ns)
> +{
> +	return mnt->mnt_ns == mnt_ns;
> +}
> +
>  static inline int check_mnt(struct mount *mnt)
>  {
> -	return mnt->mnt_ns == current->nsproxy->mnt_ns;
> +	return __check_mnt(mnt, current->nsproxy->mnt_ns);
>  }
>  
>  /*
> @@ -1596,7 +1602,7 @@ static int do_umount(struct mount *mnt, int flags)
>  			return -EPERM;
>  		down_write(&sb->s_umount);
>  		if (!(sb->s_flags & MS_RDONLY))
> -			retval = do_remount_sb(sb, MS_RDONLY, NULL, 0);
> +			retval = do_remount_sb(sb, MS_RDONLY, NULL, 0, NULL);
>  		up_write(&sb->s_umount);
>  		return retval;
>  	}
> @@ -2279,6 +2285,26 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
>  }
>  
>  /*
> + * Parse the monolithic page of mount data given to sys_mount().
> + */
> +static int parse_monolithic_mount_data(struct mount_context *mc, void *data)
> +{
> +	int (*monolithic_mount_data)(struct mount_context *, void *);
> +	int ret;
> +
> +	monolithic_mount_data = mc->ops->monolithic_mount_data;
> +	if (!monolithic_mount_data)
> +		monolithic_mount_data = generic_monolithic_mount_data;
> +
> +	ret = monolithic_mount_data(mc, data);
> +	if (ret < 0)
> +		return ret;
> +	if (mc->ops->validate)
> +		return mc->ops->validate(mc);
> +	return 0;
> +}
> +
> +/*
>   * change filesystem flags. dir should be a physical root of filesystem.
>   * If you've mounted a non-root directory somewhere and want to do remount
>   * on it - tough luck.
> @@ -2286,13 +2312,14 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
>  static int do_remount(struct path *path, int flags, int mnt_flags,
>  		      void *data)
>  {
> +	struct mount_context *mc = NULL;
>  	int err;
>  	struct super_block *sb = path->mnt->mnt_sb;
>  	struct mount *mnt = real_mount(path->mnt);
> +	struct file_system_type *type = sb->s_type;
>  
>  	if (!check_mnt(mnt))
>  		return -EINVAL;
> -
>  	if (path->dentry != path->mnt->mnt_root)
>  		return -EINVAL;
>  
> @@ -2323,9 +2350,19 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
>  		return -EPERM;
>  	}
>  
> -	err = security_sb_remount(sb, data);
> -	if (err)
> -		return err;
> +	if (type->fsopen) {
> +		mc = vfs_mntopen(path->mnt, flags, mnt_flags, MOUNT_TYPE_REMOUNT);
> +		if (IS_ERR(mc))
> +			return PTR_ERR(mc);
> +
> +		err = parse_monolithic_mount_data(mc, data);
> +		if (err < 0)
> +			goto err_mc;
> +	} else {
> +		err = security_sb_remount(sb, data);
> +		if (err)
> +			return err;
> +	}
>  
>  	down_write(&sb->s_umount);
>  	if (flags & MS_BIND)
> @@ -2333,7 +2370,7 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
>  	else if (!capable(CAP_SYS_ADMIN))
>  		err = -EPERM;
>  	else
> -		err = do_remount_sb(sb, flags, data, 0);
> +		err = do_remount_sb(sb, flags, data, 0, mc);
>  	if (!err) {
>  		lock_mount_hash();
>  		mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
> @@ -2342,6 +2379,9 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
>  		unlock_mount_hash();
>  	}
>  	up_write(&sb->s_umount);
> +err_mc:
> +	if (mc)
> +		put_mount_context(mc);
>  	return err;
>  }
>  
> @@ -2451,7 +2491,8 @@ static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
>  /*
>   * add a mount into a namespace's mount tree
>   */
> -static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
> +static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags,
> +			struct mnt_namespace *mnt_ns)
>  {
>  	struct mountpoint *mp;
>  	struct mount *parent;
> @@ -2465,7 +2506,7 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
>  
>  	parent = real_mount(path->mnt);
>  	err = -EINVAL;
> -	if (unlikely(!check_mnt(parent))) {
> +	if (unlikely(!__check_mnt(parent, mnt_ns))) {
>  		/* that's acceptable only for automounts done in private ns */
>  		if (!(mnt_flags & MNT_SHRINKABLE))
>  			goto unlock;
> @@ -2493,42 +2534,73 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
>  }
>  
>  static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
> +static int do_new_mount_mc(struct mount_context *mc, struct path *mountpoint,
> +			   unsigned int mnt_flags);
>  
>  /*
>   * create a new mount for userspace and request it to be added into the
>   * namespace's tree
>   */
> -static int do_new_mount(struct path *path, const char *fstype, int flags,
> +static int do_new_mount(struct path *mountpoint, const char *fstype, int flags,
>  			int mnt_flags, const char *name, void *data)
>  {
> -	struct file_system_type *type;
> +	struct mount_context *mc;
>  	struct vfsmount *mnt;
>  	int err;
>  
>  	if (!fstype)
>  		return -EINVAL;
>  
> -	type = get_fs_type(fstype);
> -	if (!type)
> -		return -ENODEV;
> +	mc = vfs_fsopen(fstype);
> +	if (IS_ERR(mc))
> +		return PTR_ERR(mc);
> +	mc->ms_flags = flags;
> +	mc->mnt_flags = mnt_flags;
>  
> -	mnt = vfs_kern_mount(type, flags, name, data);
> -	if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
> -	    !mnt->mnt_sb->s_subtype)
> -		mnt = fs_set_subtype(mnt, fstype);
> +	err = -ENOMEM;
> +	mc->device = kstrdup(name, GFP_KERNEL);
> +	if (!mc->device)
> +		goto err_mc;
>  
> -	put_filesystem(type);
> -	if (IS_ERR(mnt))
> -		return PTR_ERR(mnt);
> +	if (mc->ops) {
> +		err = parse_monolithic_mount_data(mc, data);
> +		if (err < 0)
> +			goto err_mc;
>  
> -	if (mount_too_revealing(mnt, &mnt_flags)) {
> -		mntput(mnt);
> -		return -EPERM;
> +		err = do_new_mount_mc(mc, mountpoint, mnt_flags);
> +		if (err)
> +			goto err_mc;
> +
> +	} else {
> +		mnt = vfs_kern_mount(mc->fs_type, flags, name, data);
> +		if (!IS_ERR(mnt) && (mc->fs_type->fs_flags & FS_HAS_SUBTYPE) &&
> +		    !mnt->mnt_sb->s_subtype)
> +			mnt = fs_set_subtype(mnt, fstype);
> +
> +		if (IS_ERR(mnt)) {
> +			err = PTR_ERR(mnt);
> +			goto err_mc;
> +		}
> +
> +		err = -EPERM;
> +		if (mount_too_revealing(mnt, &mnt_flags))
> +			goto err_mnt;
> +
> +		err = do_add_mount(real_mount(mnt), mountpoint, mnt_flags,
> +				   mc->mnt_ns);
> +		if (err)
> +			goto err_mnt;
>  	}
>  
> -	err = do_add_mount(real_mount(mnt), path, mnt_flags);
> -	if (err)
> -		mntput(mnt);
> +	put_mount_context(mc);
> +	return 0;
> +
> +err_mnt:
> +	mntput(mnt);
> +err_mc:
> +	if (mc->error)
> +		pr_info("Mount failed: %s\n", mc->error);
> +	put_mount_context(mc);
>  	return err;
>  }
>  
> @@ -2547,7 +2619,8 @@ int finish_automount(struct vfsmount *m, struct path *path)
>  		goto fail;
>  	}
>  
> -	err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
> +	err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE,
> +			   current->nsproxy->mnt_ns);
>  	if (!err)
>  		return 0;
>  fail:
> @@ -3061,6 +3134,130 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
>  	return ret;
>  }
>  
> +static struct dentry *__do_mount_mc(struct mount_context *mc)
> +{
> +	struct super_block *sb;
> +	struct dentry *root;
> +	int ret;
> +
> +	root = mc->ops->mount(mc);
> +	if (IS_ERR(root))
> +		return root;
> +
> +	sb = root->d_sb;
> +	BUG_ON(!sb);
> +	WARN_ON(!sb->s_bdi);
> +	sb->s_flags |= MS_BORN;
> +
> +	ret = security_mount_ctx_kern_mount(mc, sb);
> +	if (ret < 0)
> +		goto err_sb;
> +
> +	/*
> +	 * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
> +	 * but s_maxbytes was an unsigned long long for many releases. Throw
> +	 * this warning for a little while to try and catch filesystems that
> +	 * violate this rule.
> +	 */
> +	WARN((sb->s_maxbytes < 0), "%s set sb->s_maxbytes to "
> +		"negative value (%lld)\n", mc->fs_type->name, sb->s_maxbytes);
> +
> +	up_write(&sb->s_umount);
> +	return root;
> +
> +err_sb:
> +	dput(root);
> +	deactivate_locked_super(sb);
> +	return ERR_PTR(ret);
> +}
> +
> +struct vfsmount *vfs_kern_mount_mc(struct mount_context *mc)
> +{
> +	struct dentry *root;
> +	struct mount *mnt;
> +	int ret;
> +
> +	if (mc->ops->validate) {
> +		ret = mc->ops->validate(mc);
> +		if (ret < 0)
> +			return ERR_PTR(ret);
> +	}
> +
> +	mnt = alloc_vfsmnt(mc->device ?: "none");
> +	if (!mnt)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (mc->ms_flags & MS_KERNMOUNT)
> +		mnt->mnt.mnt_flags = MNT_INTERNAL;
> +
> +	root = __do_mount_mc(mc);
> +	if (IS_ERR(root)) {
> +		mnt_free_id(mnt);
> +		free_vfsmnt(mnt);
> +		return ERR_CAST(root);
> +	}
> +
> +	mnt->mnt.mnt_root	= root;
> +	mnt->mnt.mnt_sb		= root->d_sb;
> +	mnt->mnt_mountpoint	= mnt->mnt.mnt_root;
> +	mnt->mnt_parent		= mnt;
> +	lock_mount_hash();
> +	list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
> +	unlock_mount_hash();
> +	return &mnt->mnt;
> +}
> +EXPORT_SYMBOL_GPL(vfs_kern_mount_mc);
> +
> +struct vfsmount *
> +vfs_submount_mc(const struct dentry *mountpoint, struct mount_context *mc)
> +{
> +	/* Until it is worked out how to pass the user namespace
> +	 * through from the parent mount to the submount don't support
> +	 * unprivileged mounts with submounts.
> +	 */
> +	if (mountpoint->d_sb->s_user_ns != &init_user_ns)
> +		return ERR_PTR(-EPERM);
> +
> +	mc->ms_flags = MS_SUBMOUNT;
> +	return vfs_kern_mount_mc(mc);
> +}
> +EXPORT_SYMBOL_GPL(vfs_submount_mc);
> +
> +static int do_new_mount_mc(struct mount_context *mc, struct path *mountpoint,
> +			   unsigned int mnt_flags)
> +{
> +	struct vfsmount *mnt;
> +	int ret;
> +
> +	mnt = vfs_kern_mount_mc(mc);
> +	if (IS_ERR(mnt))
> +		return PTR_ERR(mnt);
> +
> +	if ((mc->fs_type->fs_flags & FS_HAS_SUBTYPE) &&
> +	    !mnt->mnt_sb->s_subtype) {
> +		mnt = fs_set_subtype(mnt, mc->fs_type->name);
> +		if (IS_ERR(mnt))
> +			return PTR_ERR(mnt);
> +	}
> +
> +	ret = -EPERM;
> +	if (mount_too_revealing(mnt, &mnt_flags)) {
> +		mc->error = "VFS: Mount too revealing";
> +		goto err_mnt;
> +	}
> +
> +	ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags, mc->mnt_ns);
> +	if (ret < 0) {
> +		mc->error = "VFS: Failed to add mount";
> +		goto err_mnt;
> +	}
> +	return ret;
> +
> +err_mnt:
> +	mntput(mnt);
> +	return ret;
> +}
> +
>  /*
>   * Return true if path is reachable from root
>   *
> @@ -3302,6 +3499,23 @@ struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
>  }
>  EXPORT_SYMBOL_GPL(kern_mount_data);
>  
> +struct vfsmount *kern_mount_data_mc(struct mount_context *mc)
> +{
> +	struct vfsmount *mnt;
> +
> +	mc->ms_flags = MS_KERNMOUNT;
> +	mnt = vfs_kern_mount_mc(mc);
> +	if (!IS_ERR(mnt)) {
> +		/*
> +		 * it is a longterm mount, don't release mnt until
> +		 * we unmount before file sys is unregistered
> +		*/
> +		real_mount(mnt)->mnt_ns = MNT_NS_INTERNAL;
> +	}
> +	return mnt;
> +}
> +EXPORT_SYMBOL_GPL(kern_mount_data_mc);
> +
>  void kern_unmount(struct vfsmount *mnt)
>  {
>  	/* release long term mount so mount point can be released */
> diff --git a/fs/super.c b/fs/super.c
> index adb0c0de428c..6e7b86520337 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -805,10 +805,13 @@ struct super_block *user_get_super(dev_t dev)
>   *	@flags:	numeric part of options
>   *	@data:	the rest of options
>   *      @force: whether or not to force the change
> + *	@mc:	the mount context for filesystems that support it
> + *		(NULL if called from emergency or umount)
>   *
>   *	Alters the mount options of a mounted file system.
>   */
> -int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
> +int do_remount_sb(struct super_block *sb, int flags, void *data, int force,
> +		  struct mount_context *mc)
>  {
>  	int retval;
>  	int remount_ro;
> @@ -850,8 +853,14 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
>  		}
>  	}
>  
> -	if (sb->s_op->remount_fs) {
> -		retval = sb->s_op->remount_fs(sb, &flags, data);
> +	if (sb->s_op->remount_fs_mc ||
> +	    sb->s_op->remount_fs) {
> +		if (sb->s_op->remount_fs_mc) {
> +		    retval = sb->s_op->remount_fs_mc(sb, mc);
> +		    flags = mc->ms_flags;
> +		} else {
> +			retval = sb->s_op->remount_fs(sb, &flags, data);
> +		}
>  		if (retval) {
>  			if (!force)
>  				goto cancel_readonly;
> @@ -898,7 +907,7 @@ static void do_emergency_remount(struct work_struct *work)
>  			/*
>  			 * What lock protects sb->s_flags??
>  			 */
> -			do_remount_sb(sb, MS_RDONLY, NULL, 1);
> +			do_remount_sb(sb, MS_RDONLY, NULL, 1, NULL);
>  		}
>  		up_write(&sb->s_umount);
>  		spin_lock(&sb_lock);
> @@ -1048,6 +1057,37 @@ struct dentry *mount_ns(struct file_system_type *fs_type,
>  
>  EXPORT_SYMBOL(mount_ns);
>  
> +struct dentry *mount_ns_mc(struct mount_context *mc, void *ns)
> +{
> +	struct super_block *sb;
> +
> +	/* Don't allow mounting unless the caller has CAP_SYS_ADMIN
> +	 * over the namespace.
> +	 */
> +	if (!(mc->ms_flags & MS_KERNMOUNT) &&
> +	    !ns_capable(mc->user_ns, CAP_SYS_ADMIN))
> +		return ERR_PTR(-EPERM);
> +
> +	sb = sget_userns(mc->fs_type, ns_test_super, ns_set_super,
> +			 mc->ms_flags, mc->user_ns, ns);
> +	if (IS_ERR(sb))
> +		return ERR_CAST(sb);
> +
> +	if (!sb->s_root) {
> +		int err;
> +		err = mc->ops->fill_super(sb, mc);
> +		if (err) {
> +			deactivate_locked_super(sb);
> +			return ERR_PTR(err);
> +		}
> +
> +		sb->s_flags |= MS_ACTIVE;
> +	}
> +
> +	return dget(sb->s_root);
> +}
> +EXPORT_SYMBOL(mount_ns_mc);
> +
>  #ifdef CONFIG_BLOCK
>  static int set_bdev_super(struct super_block *s, void *data)
>  {
> @@ -1196,7 +1236,7 @@ struct dentry *mount_single(struct file_system_type *fs_type,
>  		}
>  		s->s_flags |= MS_ACTIVE;
>  	} else {
> -		do_remount_sb(s, flags, data, 0);
> +		do_remount_sb(s, flags, data, 0, NULL);
>  	}
>  	return dget(s->s_root);
>  }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 30e5c14bd743..40fe5c5054ec 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -55,6 +55,7 @@ struct workqueue_struct;
>  struct iov_iter;
>  struct fscrypt_info;
>  struct fscrypt_operations;
> +struct mount_context;
>  
>  extern void __init inode_init(void);
>  extern void __init inode_init_early(void);
> @@ -701,6 +702,11 @@ static inline void inode_unlock(struct inode *inode)
>  	up_write(&inode->i_rwsem);
>  }
>  
> +static inline int inode_lock_killable(struct inode *inode)
> +{
> +	return down_write_killable(&inode->i_rwsem);
> +}
> +
>  static inline void inode_lock_shared(struct inode *inode)
>  {
>  	down_read(&inode->i_rwsem);
> @@ -1786,6 +1792,7 @@ struct super_operations {
>  	int (*unfreeze_fs) (struct super_block *);
>  	int (*statfs) (struct dentry *, struct kstatfs *);
>  	int (*remount_fs) (struct super_block *, int *, char *);
> +	int (*remount_fs_mc) (struct super_block *, struct mount_context *);
>  	void (*umount_begin) (struct super_block *);
>  
>  	int (*show_options)(struct seq_file *, struct dentry *);
> @@ -2020,8 +2027,10 @@ struct file_system_type {
>  #define FS_HAS_SUBTYPE		4
>  #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
>  #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
> +	unsigned short mc_size;		/* Size of mount context to allocate */
>  	struct dentry *(*mount) (struct file_system_type *, int,
>  		       const char *, void *);
> +	int (*fsopen)(struct mount_context *, struct super_block *);
>  	void (*kill_sb) (struct super_block *);
>  	struct module *owner;
>  	struct file_system_type * next;
> @@ -2039,6 +2048,7 @@ struct file_system_type {
>  
>  #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)
>  
> +extern struct dentry *mount_ns_mc(struct mount_context *mc, void *ns);
>  extern struct dentry *mount_ns(struct file_system_type *fs_type,
>  	int flags, void *data, void *ns, struct user_namespace *user_ns,
>  	int (*fill_super)(struct super_block *, void *, int));
> @@ -2105,6 +2115,7 @@ extern int register_filesystem(struct file_system_type *);
>  extern int unregister_filesystem(struct file_system_type *);
>  extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
>  #define kern_mount(type) kern_mount_data(type, NULL)
> +extern struct vfsmount *kern_mount_data_mc(struct mount_context *);
>  extern void kern_unmount(struct vfsmount *mnt);
>  extern int may_umount_tree(struct vfsmount *);
>  extern int may_umount(struct vfsmount *);
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index e29d4c62a3c8..f6aa68b8e68e 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -75,6 +75,32 @@
>   *	should enable secure mode.
>   *	@bprm contains the linux_binprm structure.
>   *
> + * Security hooks for mount using fd context.
> + *
> + * @mount_ctx_alloc:
> + *	Allocate and attach a security structure to mc->security.  This pointer
> + *	is initialised to NULL by the caller.
> + *	@mc indicates the new mount context.
> + *	@src_sb indicates the source superblock of a submount.
> + * @mount_ctx_dup:
> + *	Allocate and attach a security structure to mc->security.  This pointer
> + *	is initialised to NULL by the caller.
> + *	@mc indicates the new mount context.
> + *	@src_mc indicates the original mount context.
> + * @mount_ctx_free:
> + *	Clean up a mount context.
> + *	@mc indicates the mount context.
> + * @mount_ctx_option:
> + *	Userspace provided an option to configure a mount.  The LSM may reject
> + *	it with an error and may use it for itself, in which case it should
> + *	return 1; otherwise it should return 0 to pass it on to the filesystem.
> + *	@mc indicates the mount context.
> + *	@p indicates the option in "key[=val]" form.
> + * @mount_ctx_kern_mount:
> + *	Equivalent of sb_kern_mount, but with a mount_context.
> + *	@mc indicates the mount context.
> + *	@src_sb indicates the new superblock.
> + *
>   * Security hooks for filesystem operations.
>   *
>   * @sb_alloc_security:
> @@ -1358,6 +1384,12 @@ union security_list_options {
>  	void (*bprm_committing_creds)(struct linux_binprm *bprm);
>  	void (*bprm_committed_creds)(struct linux_binprm *bprm);
>  
> +	int (*mount_ctx_alloc)(struct mount_context *mc, struct super_block *src_sb);
> +	int (*mount_ctx_dup)(struct mount_context *mc, struct mount_context *src_mc);
> +	void (*mount_ctx_free)(struct mount_context *mc);
> +	int (*mount_ctx_option)(struct mount_context *mc, char *opt);
> +	int (*mount_ctx_kern_mount)(struct mount_context *mc, struct super_block *sb);
> +
>  	int (*sb_alloc_security)(struct super_block *sb);
>  	void (*sb_free_security)(struct super_block *sb);
>  	int (*sb_copy_data)(char *orig, char *copy);
> @@ -1666,6 +1698,11 @@ struct security_hook_heads {
>  	struct list_head bprm_secureexec;
>  	struct list_head bprm_committing_creds;
>  	struct list_head bprm_committed_creds;
> +	struct list_head mount_ctx_alloc;
> +	struct list_head mount_ctx_dup;
> +	struct list_head mount_ctx_free;
> +	struct list_head mount_ctx_option;
> +	struct list_head mount_ctx_kern_mount;
>  	struct list_head sb_alloc_security;
>  	struct list_head sb_free_security;
>  	struct list_head sb_copy_data;
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 8e0352af06b7..cf2583406986 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -69,6 +69,56 @@ struct vfsmount {
>  	int mnt_flags;
>  };
>  
> +struct mount_context;
> +struct mount_context_operations {
> +	void (*free)(struct mount_context *mc);
> +	int (*dup)(struct mount_context *mc, struct mount_context *src);
> +	/* An option has been specified. */
> +	int (*option)(struct mount_context *mc, char *p);
> +	/* Parse monolithic mount data. */
> +	int (*monolithic_mount_data)(struct mount_context *mc, void *data);
> +	/* Validate the mount options */
> +	int (*validate)(struct mount_context *mc);
> +	/* Perform the mount. */
> +	struct dentry *(*mount)(struct mount_context *mc);
> +	/* Fill in a superblock */
> +	int (*fill_super)(struct super_block *s, struct mount_context *mc);
> +};
> +
> +enum mount_type {
> +	MOUNT_TYPE_NEW,		/* New mount made directly */
> +	MOUNT_TYPE_SUBMOUNT,	/* New mount made automatically */
> +	MOUNT_TYPE_REMOUNT,	/* Change of an existing mount */
> +};
> +
> +/*
> + * Mount context as allocated and constructed by fsopen().  The filesystem must
> + * support the ->ctx_*() operations.  The size of the object allocated is in
> + * struct file_system_type::mount_context_size; this must be embedded as the
> + * fist thing in the filesystem's own context.
> + */
> +struct mount_context {
> +	const struct mount_context_operations *ops;
> +	struct file_system_type	*fs_type;
> +	struct user_namespace	*user_ns;	/* The user namespace for this mount */
> +	struct mnt_namespace	*mnt_ns;	/* The mount namespace for this mount */
> +	struct pid_namespace	*pid_ns;	/* The process ID namespace for this mount */
> +	struct net		*net_ns;	/* The network namespace for this mount */
> +	const struct cred	*cred;		/* The mounter's credentials */
> +	char			*device;	/* The device name or mount target */
> +	char			*root_path;	/* The path within the mount to mount */
> +	void			*security;	/* The LSM context */
> +	const char		*error;		/* Error string to be read by read() */
> +	unsigned int		ms_flags;	/* The superblock flags (MS_*) */
> +	unsigned int		mnt_flags;	/* The mount flags (MNT_*) */
> +	bool			mounted;	/* Set when mounted */
> +	bool			sloppy;		/* Unrecognised options are okay */
> +	bool			silent;
> +	enum mount_type		mount_type : 8;
> +};
> +
> +extern const struct file_operations fs_fs_fops;
> +
>  struct file; /* forward dec */
>  struct path;
>  
> @@ -90,9 +140,26 @@ struct file_system_type;
>  extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
>  				      int flags, const char *name,
>  				      void *data);
> +extern struct vfsmount *vfs_kern_mount_mc(struct mount_context *mc);
>  extern struct vfsmount *vfs_submount(const struct dentry *mountpoint,
>  				     struct file_system_type *type,
>  				     const char *name, void *data);
> +extern struct vfsmount *vfs_submount_mc(const struct dentry *mountpoint,
> +					struct mount_context *mc);
> +extern struct mount_context *vfs_fsopen(const char *fs_name);
> +extern struct mount_context *__vfs_fsopen(struct file_system_type *fs_type,
> +					  struct super_block *src_sb,
> +					  unsigned int ms_flags,
> +					  unsigned int mnt_flags,
> +					  enum mount_type mount_type);
> +extern struct mount_context *vfs_mntopen(struct vfsmount *mnt,
> +					 unsigned int ms_flags,
> +					 unsigned int mnt_flags,
> +					 enum mount_type mount_type);
> +extern struct mount_context *vfs_dup_mount_context(struct mount_context *src);
> +extern int vfs_mount_option(struct mount_context *mc, char *data);
> +extern int generic_monolithic_mount_data(struct mount_context *ctx, void *data);
> +extern void put_mount_context(struct mount_context *ctx);
>  
>  extern void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list);
>  extern void mark_mounts_for_expiry(struct list_head *mounts);
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 96899fad7016..91efe3039bff 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -55,6 +55,7 @@ struct msg_queue;
>  struct xattr;
>  struct xfrm_sec_ctx;
>  struct mm_struct;
> +struct mount_context;
>  
>  /* If capable should audit the security request */
>  #define SECURITY_CAP_NOAUDIT 0
> @@ -220,6 +221,11 @@ int security_bprm_check(struct linux_binprm *bprm);
>  void security_bprm_committing_creds(struct linux_binprm *bprm);
>  void security_bprm_committed_creds(struct linux_binprm *bprm);
>  int security_bprm_secureexec(struct linux_binprm *bprm);
> +int security_mount_ctx_alloc(struct mount_context *mc, struct super_block *sb);
> +int security_mount_ctx_dup(struct mount_context *mc, struct mount_context *src);
> +void security_mount_ctx_free(struct mount_context *mc);
> +int security_mount_ctx_option(struct mount_context *mc, char *opt);
> +int security_mount_ctx_kern_mount(struct mount_context *mc, struct super_block *sb);
>  int security_sb_alloc(struct super_block *sb);
>  void security_sb_free(struct super_block *sb);
>  int security_sb_copy_data(char *orig, char *copy);
> @@ -513,6 +519,29 @@ static inline int security_bprm_secureexec(struct linux_binprm *bprm)
>  	return cap_bprm_secureexec(bprm);
>  }
>  
> +static inline int security_mount_ctx_alloc(struct mount_context *mc,
> +					   struct super_block *src_sb)
> +{
> +	return 0;
> +}
> +static inline int security_mount_ctx_dup(struct mount_context *mc,
> +					 struct mount_context *src)
> +{
> +	return 0;
> +}
> +static inline void security_mount_ctx_free(struct mount_context *mc)
> +{
> +}
> +static inline int security_mount_ctx_option(struct mount_context *mc, char *opt)
> +{
> +	return 0;
> +}
> +static inline int security_mount_ctx_kern_mount(struct mount_context *mc,
> +						struct super_block *sb)
> +{
> +	return 0;
> +}
> +
>  static inline int security_sb_alloc(struct super_block *sb)
>  {
>  	return 0;
> diff --git a/security/security.c b/security/security.c
> index 23555c5504f6..2e522361df66 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -309,6 +309,31 @@ int security_bprm_secureexec(struct linux_binprm *bprm)
>  	return call_int_hook(bprm_secureexec, 0, bprm);
>  }
>  
> +int security_mount_ctx_alloc(struct mount_context *mc, struct super_block *src_sb)
> +{
> +	return call_int_hook(mount_ctx_alloc, 0, mc, src_sb);
> +}
> +
> +int security_mount_ctx_dup(struct mount_context *mc, struct mount_context *src_mc)
> +{
> +	return call_int_hook(mount_ctx_dup, 0, mc, src_mc);
> +}
> +
> +void security_mount_ctx_free(struct mount_context *mc)
> +{
> +	call_void_hook(mount_ctx_free, mc);
> +}
> +
> +int security_mount_ctx_option(struct mount_context *mc, char *opt)
> +{
> +	return call_int_hook(mount_ctx_option, 0, mc, opt);
> +}
> +
> +int security_mount_ctx_kern_mount(struct mount_context *mc, struct super_block *sb)
> +{
> +	return call_int_hook(mount_ctx_kern_mount, 0, mc, sb);
> +}
> +
>  int security_sb_alloc(struct super_block *sb)
>  {
>  	return call_int_hook(sb_alloc_security, 0, sb);
> @@ -1659,6 +1684,13 @@ struct security_hook_heads security_hook_heads = {
>  		LIST_HEAD_INIT(security_hook_heads.bprm_committing_creds),
>  	.bprm_committed_creds =
>  		LIST_HEAD_INIT(security_hook_heads.bprm_committed_creds),
> +	.mount_ctx_alloc = LIST_HEAD_INIT(security_hook_heads.mount_ctx_alloc),
> +	.mount_ctx_dup = LIST_HEAD_INIT(security_hook_heads.mount_ctx_dup),
> +	.mount_ctx_free = LIST_HEAD_INIT(security_hook_heads.mount_ctx_free),
> +	.mount_ctx_option =
> +		LIST_HEAD_INIT(security_hook_heads.mount_ctx_option),
> +	.mount_ctx_kern_mount =
> +		LIST_HEAD_INIT(security_hook_heads.mount_ctx_kern_mount),
>  	.sb_alloc_security =
>  		LIST_HEAD_INIT(security_hook_heads.sb_alloc_security),
>  	.sb_free_security =
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 0c2ac318aa7f..cf38db840f71 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2826,6 +2826,179 @@ static int selinux_umount(struct vfsmount *mnt, int flags)
>  				   FILESYSTEM__UNMOUNT, NULL);
>  }
>  
> +/* fsopen mount context operations */
> +
> +static int selinux_mount_ctx_alloc(struct mount_context *mc,
> +				   struct super_block *src_sb)
> +{
> +	struct security_mnt_opts *opts;
> +
> +	opts = kzalloc(sizeof(*opts), GFP_KERNEL);
> +	if (!opts)
> +		return -ENOMEM;
> +
> +	mc->security = opts;
> +	return 0;
> +}
> +
> +static int selinux_mount_ctx_dup(struct mount_context *mc,
> +				 struct mount_context *src_mc)
> +{
> +	const struct security_mnt_opts *src = src_mc->security;
> +	struct security_mnt_opts *opts;
> +	int i, n;
> +
> +	opts = kzalloc(sizeof(*opts), GFP_KERNEL);
> +	if (!opts)
> +		return -ENOMEM;
> +	mc->security = opts;
> +
> +	if (!src || !src->num_mnt_opts)
> +		return 0;
> +	n = opts->num_mnt_opts = src->num_mnt_opts;
> +
> +	if (opts->mnt_opts) {
> +		opts->mnt_opts = kcalloc(n, sizeof(char *), GFP_KERNEL);
> +		if (!opts->mnt_opts)
> +			return -ENOMEM;
> +
> +		for (i = 0; i < n; i++) {
> +			if (src->mnt_opts[i]) {
> +				opts->mnt_opts[i] = kstrdup(src->mnt_opts[i],
> +							    GFP_KERNEL);
> +				if (!opts->mnt_opts[i])
> +					return -ENOMEM;
> +			}
> +		}
> +	}
> +
> +	if (src->mnt_opts_flags) {
> +		opts->mnt_opts_flags = kmemdup(src->mnt_opts_flags,
> +					       n * sizeof(int), GFP_KERNEL);
> +		if (!opts->mnt_opts_flags)
> +			return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +static void selinux_mount_ctx_free(struct mount_context *mc)
> +{
> +	struct security_mnt_opts *opts = mc->security;
> +
> +	security_free_mnt_opts(opts);
> +	mc->security = NULL;
> +}
> +
> +static int selinux_mount_ctx_option(struct mount_context *mc, char *opt)
> +{
> +	struct security_mnt_opts *opts = mc->security;
> +	substring_t args[MAX_OPT_ARGS];
> +	unsigned int have;
> +	char *c, **oo;
> +	void *old;
> +	int token, ctx, i;
> +
> +	token = match_token(opt, tokens, args);
> +	if (token == Opt_error)
> +		return 0; /* Doesn't belong to us. */
> +
> +	have = 0;
> +	for (i = 0; i < opts->num_mnt_opts; i++)
> +		have |= 1 << opts->mnt_opts_flags[i];
> +	if (have & (1 << token)) {
> +		mc->error = "SELinux: Duplicate mount options";
> +		return -EINVAL;
> +	}
> +
> +	switch (token) {
> +	case Opt_context:
> +		if (have & (1 << Opt_defcontext))
> +			goto incompatible;
> +		ctx = CONTEXT_MNT;
> +		goto copy_context_string;
> +
> +	case Opt_fscontext:
> +		ctx = FSCONTEXT_MNT;
> +		goto copy_context_string;
> +
> +	case Opt_rootcontext:
> +		ctx = ROOTCONTEXT_MNT;
> +		goto copy_context_string;
> +
> +	case Opt_defcontext:
> +		if (have & (1 << Opt_context))
> +			goto incompatible;
> +		ctx = DEFCONTEXT_MNT;
> +		goto copy_context_string;
> +
> +	case Opt_labelsupport:
> +		return 1;
> +
> +	default:
> +		mc->error = "SELinux: Unknown mount option";
> +		return -EINVAL;
> +	}
> +
> +copy_context_string:
> +	if (opts->num_mnt_opts > 3) {
> +		mc->error = "SELinux: Too many options";
> +		return -EINVAL;
> +	}
> +	if (!opts->mnt_opts_flags) {
> +		opts->mnt_opts_flags = kcalloc(3, sizeof(int), GFP_KERNEL);
> +		if (!opts->mnt_opts_flags)
> +			return -ENOMEM;
> +	}
> +
> +	if (opts->mnt_opts) {
> +		oo = kmalloc((opts->num_mnt_opts + 1) * sizeof(char *),
> +			     GFP_KERNEL);
> +		if (!oo)
> +			return -ENOMEM;
> +		memcpy(oo, opts->mnt_opts, opts->num_mnt_opts * sizeof(char *));
> +		oo[opts->num_mnt_opts] = NULL;
> +		old = opts->mnt_opts;
> +		opts->mnt_opts = oo;
> +		kfree(old);
> +	}
> +
> +	c = match_strdup(&args[0]);
> +	if (!c)
> +		return -ENOMEM;
> +	opts->mnt_opts[opts->num_mnt_opts] = c;
> +	opts->mnt_opts_flags[opts->num_mnt_opts] = ctx;
> +	opts->num_mnt_opts++;
> +	return 1;
> +
> +incompatible:
> +	mc->error = "SELinux: Incompatible mount options";
> +	return -EINVAL;
> +}
> +
> +static int selinux_mount_ctx_kern_mount(struct mount_context *mc,
> +					struct super_block *sb)
> +{
> +	const struct cred *cred = current_cred();
> +	struct common_audit_data ad;
> +	int rc;
> +
> +	rc = selinux_set_mnt_opts(sb, mc->security, 0, NULL);
> +	if (rc)
> +		return rc;
> +
> +	/* Allow all mounts performed by the kernel */
> +	if (mc->ms_flags & MS_KERNMOUNT)
> +		return 0;
> +
> +	ad.type = LSM_AUDIT_DATA_DENTRY;
> +	ad.u.dentry = sb->s_root;
> +	rc = superblock_has_perm(cred, sb, FILESYSTEM__MOUNT, &ad);
> +	if (rc < 0)
> +		mc->error = "SELinux: Mount of superblock not permitted";
> +	return rc;
> +}
> +
>  /* inode security operations */
>  
>  static int selinux_inode_alloc_security(struct inode *inode)
> @@ -6131,6 +6304,12 @@ static struct security_hook_list selinux_hooks[] = {
>  	LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),
>  	LSM_HOOK_INIT(bprm_secureexec, selinux_bprm_secureexec),
>  
> +	LSM_HOOK_INIT(mount_ctx_alloc, selinux_mount_ctx_alloc),
> +	LSM_HOOK_INIT(mount_ctx_dup, selinux_mount_ctx_dup),
> +	LSM_HOOK_INIT(mount_ctx_free, selinux_mount_ctx_free),
> +	LSM_HOOK_INIT(mount_ctx_option, selinux_mount_ctx_option),
> +	LSM_HOOK_INIT(mount_ctx_kern_mount, selinux_mount_ctx_kern_mount),
> +
>  	LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
>  	LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
>  	LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
> 

Whew, big patch. It all looks fairly sane at first glance though and
it's well documented AFAICT. It would be nice if this were in more
easily digestible chunks, but I don't see how to break it up right
offhand.
-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 18:13   ` Jeff Layton
@ 2017-05-03 18:26     ` Joe Perches
  2017-05-03 20:38       ` Matthew Wilcox
  2017-05-03 21:17       ` David Howells
  2017-05-03 18:37     ` David Howells
  2017-05-04  9:27     ` David Howells
  2 siblings, 2 replies; 66+ messages in thread
From: Joe Perches @ 2017-05-03 18:26 UTC (permalink / raw)
  To: Jeff Layton, David Howells, viro
  Cc: linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Wed, 2017-05-03 at 14:13 -0400, Jeff Layton wrote:
> On Wed, 2017-05-03 at 17:04 +0100, David Howells wrote:
> > Introduce a mount context concept.

trivia:

> > static int selinux_mount_ctx_option(struct mount_context *mc, char *opt)
> > +{
[]
> > +	if (opts->mnt_opts) {
> > +		oo = kmalloc((opts->num_mnt_opts + 1) * sizeof(char *),
> > +			     GFP_KERNEL);
> > +		if (!oo)
> > +			return -ENOMEM;
> > +		memcpy(oo, opts->mnt_opts, opts->num_mnt_opts * sizeof(char *));
> > +		oo[opts->num_mnt_opts] = NULL;
> > +		old = opts->mnt_opts;
> > +		opts->mnt_opts = oo;
> > +		kfree(old);
> > +	}

krealloc would probably be more efficient and possible
readable as likely there's already padding in the original
allocation.

Are there no locking constraints?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
@ 2017-05-03 18:37   ` Jeff Layton
  2017-05-03 18:41   ` David Howells
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 66+ messages in thread
From: Jeff Layton @ 2017-05-03 18:37 UTC (permalink / raw)
  To: David Howells, viro; +Cc: linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Wed, 2017-05-03 at 17:05 +0100, David Howells wrote:
> Provide an fsopen() system call that starts the process of preparing to
> mount, using an fd as a context handle.  fsopen() is given the name of the
> filesystem that will be used:
> 
> 	int mfd = fsopen(const char *fsname, int reserved,
> 			 int open_flags);
> 
> where reserved should be -1 for the moment (it will be used to pass the
> namespace information in future) and open_flags can be 0 or O_CLOEXEC.
> 
> For example:
> 
> 	mfd = fsopen("ext4", -1, O_CLOEXEC);
> 	write(mfd, "d /dev/sdb1"); // note I'm ignoring write's length arg
> 	write(mfd, "o noatime");
> 	write(mfd, "o acl");
> 	write(mfd, "o user_attr");
> 	write(mfd, "o iversion");
> 	write(mfd, "o ");
> 	write(mfd, "r /my/container"); // root inside the fs
> 	fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW);
> 
>	mfd = fsopen("afs", -1);
> 	write(mfd, "d %grand.central.org:root.cell");
> 	write(mfd, "o cell=grand.central.org");
> 	write(mfd, "r /");
> 	fsmount(mfd, AT_FDCWD, "/mnt", 0);
> 

I think one of the neat things here is that we can now error out as
soon as a bogus mount option is passed in.

That takes out some of the guesswork on which option is the problem
when you get back something like -EINVAL. On something like CIFS (which
has a lot of crazy mount options) that is much more useful.

> If an error is reported at any step, an error message may be available to be
> read() back (ENODATA will be reported if there isn't an error available) in
> the form:
> 
> 	"e <subsys>:<problem>"
> 	"e SELinux:Mount on mountpoint not permitted"
> 

I like this too. No need to printk stuff when there is a problem
mounting. The error goes straight to the caller who initiated the
mount.

That's a very nice thing in heavily containerized environments, for
instance. If you're doing mounts in a container and they fail, then
trying to scrape dmesg and figure out which messages refer to _your_
mount attempt seems like it could be rather nasty. This solves that
problem in a much more sane way, IMO.

> Once fsmount() has been called, further write() calls will incur EBUSY,
> even if the fsmount() fails.  read() is still possible to retrieve error
> information.
> 

What's the rationale for the above behavior?

A failed attempt to graft it into the tree doesn't seem like it would
have any real effect on the mount_context. While I can't think of a use
case for being able to try fsmount() again, I don't quite understand
why we'd prohibit someone from doing it.

> The fsopen() syscall creates a mount context and hangs it of the fd that it
> returns.
> 
> Netlink is not used because it is optional.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 
>  fs/Makefile                            |    2 
>  fs/fsopen.c                            |  295 ++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h               |    1 
>  include/uapi/linux/magic.h             |    1 
>  kernel/sys_ni.c                        |    3 
>  7 files changed, 303 insertions(+), 1 deletion(-)
>  create mode 100644 fs/fsopen.c
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 448ac2161112..9bf8d4c62f85 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
>  382	i386	pkey_free		sys_pkey_free
>  383	i386	statx			sys_statx
>  384	i386	arch_prctl		sys_arch_prctl			compat_sys_arch_prctl
> +385	i386	fsopen			sys_fsopen
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..9b198c5fc412 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
>  330	common	pkey_alloc		sys_pkey_alloc
>  331	common	pkey_free		sys_pkey_free
>  332	common	statx			sys_statx
> +333	common	fsopen			sys_fsopen
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/Makefile b/fs/Makefile
> index 308a104a9a07..b79024dbb37c 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -12,7 +12,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
>  		seq_file.o xattr.o libfs.o fs-writeback.o \
>  		pnode.o splice.o sync.o utimes.o \
>  		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
> -		mount_context.o
> +		mount_context.o fsopen.o
>  
>  ifeq ($(CONFIG_BLOCK),y)
>  obj-y +=	buffer.o block_dev.o direct-io.o mpage.o
> diff --git a/fs/fsopen.c b/fs/fsopen.c
> new file mode 100644
> index 000000000000..f02ea7d265db
> --- /dev/null
> +++ b/fs/fsopen.c
> @@ -0,0 +1,295 @@
> +/* fsopen.c: description
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#include <linux/mount.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/syscalls.h>
> +
> +static struct vfsmount *fs_fs_mnt __read_mostly;
> +static struct qstr empty_name = { .name = "" };
> +
> +static int fs_fs_release(struct inode *inode, struct file *file)
> +{
> +	struct mount_context *mc = file->private_data;
> +
> +	file->private_data = NULL;
> +
> +	put_mount_context(mc);
> +	return 0;
> +}
> +
> +/*
> + * Read any error message back from the fd.  Will be prefixed by "e ".
> + */
> +static ssize_t fs_fs_read(struct file *file, char __user *_buf, size_t len, loff_t *pos)
> +{
> +	struct mount_context *mc = file->private_data;
> +	const char *msg;
> +	size_t mlen;
> +
> +	msg = mc->error;
> +	if (!msg)
> +		return -ENODATA;
> +
> +	mlen = strlen(msg);
> +	if (mlen + 2 > len)
> +		return -ETOOSMALL;
> +	if (copy_to_user(_buf, "e ", 2) != 0 ||
> +	    copy_to_user(_buf + 2, msg, mlen) != 0)
> +		return -EFAULT;
> +	return mlen + 2;
> +}
> +
> +/*
> + * Userspace writes configuration data to the fd and we parse it here.  For the
> + * moment, we assume a single option per write.  Each line written is of the form
> + *
> + *	<option_type><space><stuff...>
> + *
> + *	d /dev/sda1				-- Device name
> + *	o noatime				-- Option without value
> + *	o cell=grand.central.org		-- Option with value
> + *	r /					-- Dir within device to mount
> + */
> +static ssize_t fs_fs_write(struct file *file,
> +			   const char __user *_buf, size_t len, loff_t *pos)
> +{
> +	struct mount_context *mc = file->private_data;
> +	struct inode *inode = file_inode(file);
> +	char opt[2], *data;
> +	ssize_t ret;
> +
> +	if (len < 3 || len > 4095)
> +		return -EINVAL;
> +
> +	if (copy_from_user(opt, _buf, 2) != 0)
> +		return -EFAULT;
> +	switch (opt[0]) {
> +	case 'd':
> +	case 'o':
> +	case 'r':
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	if (opt[1] != ' ')
> +		return -EINVAL;
> +
> +	data = kmalloc(len - 2 + 1, GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	ret = -EFAULT;
> +	if (copy_from_user(data, _buf + 2, len - 2) != 0)
> +		goto err_free;
> +	data[len - 2] = 0;
> +
> +	/* From this point onwards we need to lock the fd against someone
> +	 * trying to mount it.
> +	 */
> +	ret = inode_lock_killable(inode);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = -EBUSY;
> +	if (mc->mounted)
> +		goto err_unlock;
> +
> +	ret = -EINVAL;
> +	switch (opt[0]) {
> +	case 'd':
> +		if (mc->device)
> +			goto err_unlock;
> +		mc->device = data;
> +		data = NULL;
> +		break;
> +
> +	case 'o':
> +		ret = vfs_mount_option(mc, data);
> +		if (ret < 0)
> +			goto err_unlock;
> +		break;
> +
> +	case 'r':
> +		if (mc->root_path)
> +			goto err_unlock;
> +		mc->root_path = data;
> +		data = NULL;
> +		break;
> +
> +	default:
> +		goto err_unlock;
> +	}
> +
> +	ret = len;
> +err_unlock:
> +	inode_unlock(inode);
> +err_free:
> +	kfree(data);
> +	return ret;
> +}
> +
> +const struct file_operations fs_fs_fops = {
> +	.read		= fs_fs_read,
> +	.write		= fs_fs_write,
> +	.release	= fs_fs_release,
> +	.llseek		= no_llseek,
> +};
> +
> +/*
> + * Indicate the name we want to display the filesystem file as.
> + */
> +static char *fs_fs_dname(struct dentry *dentry, char *buffer, int buflen)
> +{
> +	return dynamic_dname(dentry, buffer, buflen, "fs:[%lu]",
> +			     d_inode(dentry)->i_ino);
> +}
> +
> +static const struct dentry_operations fs_fs_dentry_operations = {
> +	.d_dname	= fs_fs_dname,
> +};
> +
> +/*
> + * Create a file that can be used to configure a new mount.
> + */
> +static struct file *create_fs_file(struct mount_context *mc)
> +{
> +	struct inode *inode;
> +	struct file *f;
> +	struct path path;
> +	int ret;
> +
> +	inode = alloc_anon_inode(fs_fs_mnt->mnt_sb);
> +	if (!inode)
> +		return ERR_PTR(-ENFILE);
> +	inode->i_fop = &fs_fs_fops;
> +
> +	ret = -ENOMEM;
> +	path.dentry = d_alloc_pseudo(fs_fs_mnt->mnt_sb, &empty_name);
> +	if (!path.dentry)
> +		goto err_inode;
> +	path.mnt = mntget(fs_fs_mnt);
> +
> +	d_instantiate(path.dentry, inode);
> +
> +	f = alloc_file(&path, FMODE_READ | FMODE_WRITE, &fs_fs_fops);
> +	if (IS_ERR(f)) {
> +		ret = PTR_ERR(f);
> +		goto err_file;
> +	}
> +
> +	f->private_data = mc;
> +	return f;
> +
> +err_file:
> +	path_put(&path);
> +	return ERR_PTR(ret);
> +
> +err_inode:
> +	iput(inode);
> +	return ERR_PTR(ret);
> +}
> +
> +static const struct super_operations fs_fs_ops = {
> +	.drop_inode	= generic_delete_inode,
> +	.destroy_inode	= free_inode_nonrcu,
> +	.statfs		= simple_statfs,
> +};
> +
> +static struct dentry *fs_fs_mount(struct file_system_type *fs_type,
> +				  int flags, const char *dev_name,
> +				  void *data)
> +{
> +	return mount_pseudo(fs_type, "fs_fs:", &fs_fs_ops,
> +			    &fs_fs_dentry_operations, FS_FS_MAGIC);
> +}
> +
> +static struct file_system_type fs_fs_type = {
> +	.name		= "fs_fs",
> +	.mount		= fs_fs_mount,
> +	.kill_sb	= kill_anon_super,
> +};
> +
> +static int __init init_fs_fs(void)
> +{
> +	int ret;
> +
> +	ret = register_filesystem(&fs_fs_type);
> +	if (ret < 0)
> +		panic("Cannot register fs_fs\n");
> +
> +	fs_fs_mnt = kern_mount(&fs_fs_type);
> +	if (IS_ERR(fs_fs_mnt))
> +		panic("Cannot mount fs_fs: %ld\n", PTR_ERR(fs_fs_mnt));
> +	return 0;
> +}
> +
> +fs_initcall(init_fs_fs);
> +
> +/*
> + * Open a filesystem by name so that it can be configured for mounting.
> + *
> + * We are allowed to specify a container in which the filesystem will be
> + * opened, thereby indicating which namespaces will be used (notably, which
> + * network namespace will be used for network filesystems).
> + */
> +SYSCALL_DEFINE3(fsopen, const char __user *, _fs_name, int, reserved,
> +		unsigned int, flags)
> +{
> +	struct mount_context *mc;
> +	struct file *file;
> +	const char *fs_name;
> +	int fd, ret;
> +
> +	if (flags & ~O_CLOEXEC || reserved != -1)
> +		return -EINVAL;
> +
> +	fs_name = strndup_user(_fs_name, PAGE_SIZE);
> +	if (IS_ERR(fs_name))
> +		return PTR_ERR(fs_name);
> +
> +	mc = vfs_fsopen(fs_name);
> +	if (IS_ERR(mc)) {
> +		ret = PTR_ERR(mc);
> +		goto err_fs_name;
> +	}
> +
> +	ret = -ENOTSUPP;
> +	if (!mc->ops)
> +		goto err_mc;
> +
> +	file = create_fs_file(mc);
> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto err_mc;
> +	}
> +
> +	ret = get_unused_fd_flags(flags & O_CLOEXEC);
> +	if (ret < 0)
> +		goto err_file;
> +
> +	fd = ret;
> +	fd_install(fd, file);
> +	return fd;
> +
> +err_file:
> +	fput(file);
> +	return ret;
> +
> +err_mc:
> +	put_mount_context(mc);
> +err_fs_name:
> +	kfree(fs_name);
> +	return ret;
> +}
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 980c3c9b06f8..91ec8802ad5d 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -905,5 +905,6 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
>  asmlinkage long sys_pkey_free(int pkey);
>  asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
>  			  unsigned mask, struct statx __user *buffer);
> +asmlinkage long sys_fsopen(const char *fs_name, int containerfd, unsigned int flags);
>  
>  #endif
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index e230af2e6855..88ae83492f7c 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -84,5 +84,6 @@
>  #define UDF_SUPER_MAGIC		0x15013346
>  #define BALLOON_KVM_MAGIC	0x13661366
>  #define ZSMALLOC_MAGIC		0x58295829
> +#define FS_FS_MAGIC		0x66736673
>  
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 8acef8576ce9..de1dc63e7e47 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -258,3 +258,6 @@ cond_syscall(sys_membarrier);
>  cond_syscall(sys_pkey_mprotect);
>  cond_syscall(sys_pkey_alloc);
>  cond_syscall(sys_pkey_free);
> +
> +/* fd-based mount */
> +cond_syscall(sys_fsopen);
> 

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 18:13   ` Jeff Layton
  2017-05-03 18:26     ` Joe Perches
@ 2017-05-03 18:37     ` David Howells
  2017-05-03 18:43       ` Joe Perches
  2017-05-03 20:11       ` David Howells
  2017-05-04  9:27     ` David Howells
  2 siblings, 2 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 18:37 UTC (permalink / raw)
  To: Joe Perches
  Cc: dhowells, Jeff Layton, viro, linux-fsdevel, linux-nfs,
	linux-kernel, mszeredi

Joe Perches <joe@perches.com> wrote:

> krealloc would probably be more efficient and possible
> readable as likely there's already padding in the original
> allocation.

The problem is if krealloc() fails: you've lost all those pointers to things
you then need to free.

> Are there no locking constraints?

Generally, no, not until you do the ->mount() op.  Also remounting needs a
lock, but that's already done with the sb->s_umount lock.

However, that said, if you do:

	fd = fsopen("foofs");
	write(fd, "o foo=bar", ...);
	fsmount(fd, "/foo");

then the fsmount() and write() calls have to lock against other fsmount() and
write() calls.  I use the inode lock for this.  [Note that it probably should
be interruptible rather than just killable, but there's no primitive for that
as yet].

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
  2017-05-03 18:37   ` Jeff Layton
@ 2017-05-03 18:41   ` David Howells
  2017-05-03 20:44   ` Rasmus Villemoes
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 18:41 UTC (permalink / raw)
  To: Jeff Layton
  Cc: dhowells, viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

Jeff Layton <jlayton@poochiereds.net> wrote:

> I think one of the neat things here is that we can now error out as
> soon as a bogus mount option is passed in.

It means that the 'sloppy' option can also now be implemented in userspace.

> > Once fsmount() has been called, further write() calls will incur EBUSY,
> > even if the fsmount() fails.  read() is still possible to retrieve error
> > information.
> > 
> 
> What's the rationale for the above behavior?
> 
> A failed attempt to graft it into the tree doesn't seem like it would
> have any real effect on the mount_context. While I can't think of a use
> case for being able to try fsmount() again, I don't quite understand
> why we'd prohibit someone from doing it.

The mount procedure is allowed to preallocate resources and attach them to the
mount context and ->mount() is allowed to use them up, say by transferring
them to the superblock.  The mount context is then in a degraded state and
cannot necessarily be reused.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 18:37     ` David Howells
@ 2017-05-03 18:43       ` Joe Perches
  2017-05-03 20:11       ` David Howells
  1 sibling, 0 replies; 66+ messages in thread
From: Joe Perches @ 2017-05-03 18:43 UTC (permalink / raw)
  To: David Howells
  Cc: Jeff Layton, viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Wed, 2017-05-03 at 19:37 +0100, David Howells wrote:
> Joe Perches <joe@perches.com> wrote:
> 
> > krealloc would probably be more efficient and possible
> > readable as likely there's already padding in the original
> > allocation.
> 
> The problem is if krealloc() fails: you've lost all those pointers to things
> you then need to free.

Huh?  How could that happen?

krealloc must always use a temporary.
If krealloc returns NULL, the original allocation is kept.

> > Are there no locking constraints?
> 
> Generally, no, not until you do the ->mount() op.  Also remounting needs a
> lock, but that's already done with the sb->s_umount lock.
> 
> However, that said, if you do:
> 
> 	fd = fsopen("foofs");
> 	write(fd, "o foo=bar", ...);
> 	fsmount(fd, "/foo");
> 
> then the fsmount() and write() calls have to lock against other fsmount() and
> write() calls.  I use the inode lock for this.  [Note that it probably should
> be interruptible rather than just killable, but there's no primitive for that
> as yet].
> 
> David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/9] Provide a function to create a NUL-terminated string from unterminated data
  2017-05-03 16:04 ` [PATCH 1/9] Provide a function to create a NUL-terminated string from unterminated data David Howells
  2017-05-03 16:55   ` Jeff Layton
@ 2017-05-03 19:26   ` Rasmus Villemoes
  2017-05-03 20:13   ` David Howells
  2 siblings, 0 replies; 66+ messages in thread
From: Rasmus Villemoes @ 2017-05-03 19:26 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Wed, May 03 2017, David Howells <dhowells@redhat.com> wrote:

> Provide a function, kstrcreate()

<bikeshed>why not kmemdup_nul, since it seems to be to kmemdup exactly as
memdup_user_nul is to memdup_user?</bikeshed> 

Rasmus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 18:37     ` David Howells
  2017-05-03 18:43       ` Joe Perches
@ 2017-05-03 20:11       ` David Howells
  1 sibling, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 20:11 UTC (permalink / raw)
  To: Joe Perches
  Cc: dhowells, Jeff Layton, viro, linux-fsdevel, linux-nfs,
	linux-kernel, mszeredi

Joe Perches <joe@perches.com> wrote:

> > > krealloc would probably be more efficient and possible
> > > readable as likely there's already padding in the original
> > > allocation.
> > 
> > The problem is if krealloc() fails: you've lost all those pointers to things
> > you then need to free.
> 
> Huh?  How could that happen?
> 
> krealloc must always use a temporary.
> If krealloc returns NULL, the original allocation is kept.

Hmmm...  Good point.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/9] Provide a function to create a NUL-terminated string from unterminated data
  2017-05-03 16:04 ` [PATCH 1/9] Provide a function to create a NUL-terminated string from unterminated data David Howells
  2017-05-03 16:55   ` Jeff Layton
  2017-05-03 19:26   ` Rasmus Villemoes
@ 2017-05-03 20:13   ` David Howells
  2 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 20:13 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: dhowells, viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:

> > Provide a function, kstrcreate()
> 
> <bikeshed>why not kmemdup_nul, since it seems to be to kmemdup exactly as
> memdup_user_nul is to memdup_user?</bikeshed> 

Yeah, that would work too.  Or kmem2str().

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 18:26     ` Joe Perches
@ 2017-05-03 20:38       ` Matthew Wilcox
  2017-05-03 21:36         ` Joe Perches
  2017-05-03 21:17       ` David Howells
  1 sibling, 1 reply; 66+ messages in thread
From: Matthew Wilcox @ 2017-05-03 20:38 UTC (permalink / raw)
  To: Joe Perches
  Cc: Jeff Layton, David Howells, viro, linux-fsdevel, linux-nfs,
	linux-kernel, mszeredi

On Wed, May 03, 2017 at 11:26:38AM -0700, Joe Perches wrote:
> On Wed, 2017-05-03 at 14:13 -0400, Jeff Layton wrote:
> > On Wed, 2017-05-03 at 17:04 +0100, David Howells wrote:
> > > +		oo = kmalloc((opts->num_mnt_opts + 1) * sizeof(char *),
> > > +			     GFP_KERNEL);

If we're picking nits, then this should be kcalloc in case somebody
passed in 2^31 in num_mnt_opts.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
  2017-05-03 18:37   ` Jeff Layton
  2017-05-03 18:41   ` David Howells
@ 2017-05-03 20:44   ` Rasmus Villemoes
  2017-05-04 10:40   ` Karel Zak
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 66+ messages in thread
From: Rasmus Villemoes @ 2017-05-03 20:44 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Wed, May 03 2017, David Howells <dhowells@redhat.com> wrote:

> --- /dev/null
> +++ b/fs/fsopen.c
> @@ -0,0 +1,295 @@
> +/* fsopen.c: description
> + *

leftover from some template?

> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#include <linux/mount.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/syscalls.h>
> +
> +static struct vfsmount *fs_fs_mnt __read_mostly;
> +static struct qstr empty_name = { .name = "" };
> +
> +static int fs_fs_release(struct inode *inode, struct file *file)
> +{
> +	struct mount_context *mc = file->private_data;
> +
> +	file->private_data = NULL;
> +
> +	put_mount_context(mc);
> +	return 0;
> +}
> +
> +/*
> + * Read any error message back from the fd.  Will be prefixed by "e ".
> + */
> +static ssize_t fs_fs_read(struct file *file, char __user *_buf, size_t len, loff_t *pos)
> +{
> +	struct mount_context *mc = file->private_data;
> +	const char *msg;
> +	size_t mlen;
> +
> +	msg = mc->error;
> +	if (!msg)
> +		return -ENODATA;
> +
> +	mlen = strlen(msg);
> +	if (mlen + 2 > len)
> +		return -ETOOSMALL;
> +
> +	if (copy_to_user(_buf, "e ", 2) != 0 ||
> +	    copy_to_user(_buf + 2, msg, mlen) != 0)
> +		return -EFAULT;
> +	return mlen + 2;
> +}

OK, mc->error must be static data, so no lifetime problems. But is it
possible for the compiler to mess this up and reload msg from mc->error
when it's about to do the user copy, so that if some other thread has
managed to change mc->error (or is the error state sticky and no further
operations allowed?) we'd copy from a string with a different length?

> +/*
> + * Userspace writes configuration data to the fd and we parse it here.  For the
> + * moment, we assume a single option per write.  Each line written is of the form
> + *
> + *	<option_type><space><stuff...>
> + *
> + *	d /dev/sda1				-- Device name
> + *	o noatime				-- Option without value
> + *	o cell=grand.central.org		-- Option with value
> + *	r /					-- Dir within device to mount
> + */
> +static ssize_t fs_fs_write(struct file *file,
> +			   const char __user *_buf, size_t len, loff_t *pos)
> +{
> +	struct mount_context *mc = file->private_data;
> +	struct inode *inode = file_inode(file);
> +	char opt[2], *data;
> +	ssize_t ret;
> +
> +	if (len < 3 || len > 4095)
> +		return -EINVAL;
> +
> +	if (copy_from_user(opt, _buf, 2) != 0)
> +		return -EFAULT;
> +	switch (opt[0]) {
> +	case 'd':
> +	case 'o':
> +	case 'r':
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	if (opt[1] != ' ')
> +		return -EINVAL;
> +
> +	data = kmalloc(len - 2 + 1, GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	ret = -EFAULT;
> +	if (copy_from_user(data, _buf + 2, len - 2) != 0)
> +		goto err_free;
> +	data[len - 2] = 0;
> +

This hunk seems to be equivalent to

data = memdup_user_nul(_buf + 2, len - 2);
if (IS_ERR(data))
  return PTR_ERR(data);

plus the err_free: label gets killed...

> 
> +	/* From this point onwards we need to lock the fd against someone
> +	 * trying to mount it.
> +	 */
> +	ret = inode_lock_killable(inode);
> +	if (ret < 0)
> +		return ret;

...except that we need to jump to it here to avoid leaking data.

> +	ret = -EBUSY;
> +	if (mc->mounted)
> +		goto err_unlock;
> +
> +	ret = -EINVAL;
> +	switch (opt[0]) {
> +	case 'd':
> +		if (mc->device)
> +			goto err_unlock;
> +		mc->device = data;
> +		data = NULL;
> +		break;
> +
> +	case 'o':
> +		ret = vfs_mount_option(mc, data);
> +		if (ret < 0)
> +			goto err_unlock;
> +		break;
> +
> +	case 'r':
> +		if (mc->root_path)
> +			goto err_unlock;
> +		mc->root_path = data;
> +		data = NULL;
> +		break;
> +
> +	default:
> +		goto err_unlock;
> +	}
> +
> +	ret = len;
> +err_unlock:
> +	inode_unlock(inode);
> +err_free:
> +	kfree(data);
> +	return ret;
> +}
> +
> +const struct file_operations fs_fs_fops = {
> +	.read		= fs_fs_read,
> +	.write		= fs_fs_write,
> +	.release	= fs_fs_release,
> +	.llseek		= no_llseek,
> +};
> +

static const struct ?

> +/*
> + * Open a filesystem by name so that it can be configured for mounting.
> + *
> + * We are allowed to specify a container in which the filesystem will be
> + * opened, thereby indicating which namespaces will be used (notably, which
> + * network namespace will be used for network filesystems).
> + */
> +SYSCALL_DEFINE3(fsopen, const char __user *, _fs_name, int, reserved,
> +		unsigned int, flags)
> +{
> +	struct mount_context *mc;
> +	struct file *file;
> +	const char *fs_name;
> +	int fd, ret;
> +
> +	if (flags & ~O_CLOEXEC || reserved != -1)
> +		return -EINVAL;
> +
> +	fs_name = strndup_user(_fs_name, PAGE_SIZE);
> +	if (IS_ERR(fs_name))
> +		return PTR_ERR(fs_name);
> +
> +	mc = vfs_fsopen(fs_name);
> +	if (IS_ERR(mc)) {
> +		ret = PTR_ERR(mc);
> +		goto err_fs_name;
> +	}
> +

Where does fs_name now get freed? vfs_fsopen doesn't seem to do it on
success? (If it did, the fallthrough from err_mc: to err_fs_name: would
be wrong.)

> +	ret = -ENOTSUPP;
> +	if (!mc->ops)
> +		goto err_mc;
> +
> +	file = create_fs_file(mc);
> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto err_mc;
> +	}
> +
> +	ret = get_unused_fd_flags(flags & O_CLOEXEC);
> +	if (ret < 0)
> +		goto err_file;
> +
> +	fd = ret;
> +	fd_install(fd, file);
> +	return fd;
> +
> +err_file:
> +	fput(file);
> +	return ret;
> +
> +err_mc:
> +	put_mount_context(mc);
> +err_fs_name:
> +	kfree(fs_name);
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 18:26     ` Joe Perches
  2017-05-03 20:38       ` Matthew Wilcox
@ 2017-05-03 21:17       ` David Howells
  1 sibling, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-03 21:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Joe Perches, Jeff Layton, viro, linux-fsdevel,
	linux-nfs, linux-kernel, mszeredi

Matthew Wilcox <willy@infradead.org> wrote:

> On Wed, May 03, 2017 at 11:26:38AM -0700, Joe Perches wrote:
> > On Wed, 2017-05-03 at 14:13 -0400, Jeff Layton wrote:
> > > On Wed, 2017-05-03 at 17:04 +0100, David Howells wrote:
> > > > +		oo = kmalloc((opts->num_mnt_opts + 1) * sizeof(char *),
> > > > +			     GFP_KERNEL);
> 
> If we're picking nits, then this should be kcalloc in case somebody
> passed in 2^31 in num_mnt_opts.

A few lines previously there is:

	if (opts->num_mnt_opts > 3) {
		mc->error = "SELinux: Too many options";
		return -EINVAL;
	}

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 20:38       ` Matthew Wilcox
@ 2017-05-03 21:36         ` Joe Perches
  2017-05-04  6:28           ` Julia Lawall
  0 siblings, 1 reply; 66+ messages in thread
From: Joe Perches @ 2017-05-03 21:36 UTC (permalink / raw)
  To: Matthew Wilcox, Julia Lawall, cocci
  Cc: Jeff Layton, David Howells, viro, linux-fsdevel, linux-nfs,
	linux-kernel, mszeredi

(adding Julia Lawall and cocci)

On Wed, 2017-05-03 at 13:38 -0700, Matthew Wilcox wrote:
> On Wed, May 03, 2017 at 11:26:38AM -0700, Joe Perches wrote:
> > On Wed, 2017-05-03 at 14:13 -0400, Jeff Layton wrote:
> > > On Wed, 2017-05-03 at 17:04 +0100, David Howells wrote:
> > > > +		oo = kmalloc((opts->num_mnt_opts + 1) * sizeof(char *),
> > > > +			     GFP_KERNEL);
> 
> If we're picking nits, then this should be kcalloc in case somebody
> passed in 2^31 in num_mnt_opts.

There are likely dozens to hundreds of possible/silent
multiplication overflow defects in the kernel, not just
in allocations.

Auditing the sources would seem labor intensive.

Perhaps coccinelle could help find them.

Perhaps there should be some overflow checking functions
added to math64.h

Maybe some form like:

u32 u32_mul_u32_u32(u32 a, u32 b)
{
	u32 res = a * b;

	WARN_ON(a != 0 && res / a != b);

	return res;
}

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 16:04 ` [PATCH 3/9] VFS: Introduce a mount context David Howells
  2017-05-03 18:13   ` Jeff Layton
@ 2017-05-03 21:43   ` Rasmus Villemoes
  2017-05-04 10:22   ` David Howells
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 66+ messages in thread
From: Rasmus Villemoes @ 2017-05-03 21:43 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Wed, May 03 2017, David Howells <dhowells@redhat.com> wrote:

> fs_type->fsopen() is called to set it up.  fs_type->mc_size says how much
> should be added on to the mount context for the filesystem's use.

This is repeated several times in the documentation, but the code says
that ->mc_size should be the full size of the struct wrapping struct
mount_context.

> diff --git a/fs/mount.h b/fs/mount.h
> index 2826543a131d..b1e99b38f2ee 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -108,9 +108,10 @@ static inline void detach_mounts(struct dentry *dentry)
>  	__detach_mounts(dentry);
>  }
>  
> -static inline void get_mnt_ns(struct mnt_namespace *ns)
> +static inline struct mnt_namespace *get_mnt_ns(struct mnt_namespace *ns)
>  {
>  	atomic_inc(&ns->count);
> +	return ns;
>  }
>
>  extern seqlock_t mount_lock;

it's not much, but at least this could go into a patch of its own.

> +/**
> + * __vfs_fsopen - Open a filesystem and create a mount context
> + * @fs_type: The filesystem type
> + * @src_sb: A superblock from which this one derives (or NULL)
> + * @ms_flags: Superblock flags and op flags (such as MS_REMOUNT)
> + * @mnt_flags: Mountpoint flags, such as MNT_READONLY
> + * @mount_type: Type of mount
> + *
> + * Open a filesystem and create a mount context.  The mount context is
> + * initialised with the supplied flags and, if a submount/automount from
> + * another superblock (@src_sb), may have parameters such as namespaces copied
> + * across from that superblock.
> + */
> +struct mount_context *__vfs_fsopen(struct file_system_type *fs_type,
> +				   struct super_block *src_sb,
> +				   unsigned int ms_flags, unsigned int mnt_flags,
> +				   enum mount_type mount_type)
> +{
> +	struct mount_context *mc;
> +	int ret;
> +
> +	if (fs_type->fsopen && fs_type->mc_size < sizeof(*mc))
> +		BUG();

So ->mc_size can be 0 (i.e. not explicitly initialized) if fs_type does
not have ->fsopen. OK.

> +	mc = kzalloc(max_t(size_t, fs_type->mc_size, sizeof(*mc)), GFP_KERNEL);

In which case we round up to sizeof(*mc). OK.

> +	if (!mc)
> +		return ERR_PTR(-ENOMEM);
> +
> +	mc->mount_type = mount_type;
> +	mc->ms_flags = ms_flags;
> +	mc->mnt_flags = mnt_flags;
> +	mc->fs_type = fs_type;
> +	get_filesystem(fs_type);

Maybe get_filesystem should also be taught to return its argument so
this could be written like the below assignments.

> +	mc->mnt_ns = get_mnt_ns(current->nsproxy->mnt_ns);
> +	mc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> +	mc->net_ns = get_net(current->nsproxy->net_ns);
> +	mc->user_ns = get_user_ns(current_user_ns());
> +	mc->cred = get_current_cred();
> +
> +
> +/**
> + * vfs_dup_mount_context: Duplicate a mount context.
> + * @src: The mount context to copy.
> + */
> +struct mount_context *vfs_dup_mount_context(struct mount_context *src)
> +{
> +	struct mount_context *mc;
> +	int ret;
> +
> +	if (!src->ops->dup)
> +		return ERR_PTR(-ENOTSUPP);
> +
> +	mc = kmemdup(src, src->fs_type->mc_size, GFP_KERNEL);

So this assumes that vfs_dup_mount_context is only used if ->mc_size is
explicitly initialized. A max_t here as well probably wouldn't hurt.

> +	unsigned short mc_size;		/* Size of mount context to allocate */

Any particular reason to use a short? The struct doesn't pack any better.

> +static int selinux_mount_ctx_dup(struct mount_context *mc,
> +				 struct mount_context *src_mc)
> +{
> +	const struct security_mnt_opts *src = src_mc->security;
> +	struct security_mnt_opts *opts;
> +	int i, n;
> +
> +	opts = kzalloc(sizeof(*opts), GFP_KERNEL);
> +	if (!opts)
> +		return -ENOMEM;
> +	mc->security = opts;
> +
> +	if (!src || !src->num_mnt_opts)
> +		return 0;
> +	n = opts->num_mnt_opts = src->num_mnt_opts;
> +
> +	if (opts->mnt_opts) {

should probably be src->mnt_opts

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 21:36         ` Joe Perches
@ 2017-05-04  6:28           ` Julia Lawall
  0 siblings, 0 replies; 66+ messages in thread
From: Julia Lawall @ 2017-05-04  6:28 UTC (permalink / raw)
  To: Joe Perches
  Cc: Matthew Wilcox, cocci, Jeff Layton, David Howells, viro,
	linux-fsdevel, linux-nfs, linux-kernel, mszeredi

[-- Attachment #1: Type: text/plain, Size: 1144 bytes --]



On Wed, 3 May 2017, Joe Perches wrote:

> (adding Julia Lawall and cocci)
>
> On Wed, 2017-05-03 at 13:38 -0700, Matthew Wilcox wrote:
> > On Wed, May 03, 2017 at 11:26:38AM -0700, Joe Perches wrote:
> > > On Wed, 2017-05-03 at 14:13 -0400, Jeff Layton wrote:
> > > > On Wed, 2017-05-03 at 17:04 +0100, David Howells wrote:
> > > > > +		oo = kmalloc((opts->num_mnt_opts + 1) * sizeof(char *),
> > > > > +			     GFP_KERNEL);
> >
> > If we're picking nits, then this should be kcalloc in case somebody
> > passed in 2^31 in num_mnt_opts.
>
> There are likely dozens to hundreds of possible/silent
> multiplication overflow defects in the kernel, not just
> in allocations.
>
> Auditing the sources would seem labor intensive.
>
> Perhaps coccinelle could help find them.
>
> Perhaps there should be some overflow checking functions
> added to math64.h
>
> Maybe some form like:
>
> u32 u32_mul_u32_u32(u32 a, u32 b)
> {
> 	u32 res = a * b;
>
> 	WARN_ON(a != 0 && res / a != b);
>
> 	return res;
> }

Coccinelle doesn't kow about the values of variables.  It would need some
heuristics about where potentially large values can come from.

julia

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 18:13   ` Jeff Layton
  2017-05-03 18:26     ` Joe Perches
  2017-05-03 18:37     ` David Howells
@ 2017-05-04  9:27     ` David Howells
  2017-05-04 14:34       ` Joe Perches
  2 siblings, 1 reply; 66+ messages in thread
From: David Howells @ 2017-05-04  9:27 UTC (permalink / raw)
  To: Joe Perches
  Cc: dhowells, Jeff Layton, viro, linux-fsdevel, linux-nfs,
	linux-kernel, mszeredi

Joe Perches <joe@perches.com> wrote:

> krealloc would probably be more efficient and possible
> readable as likely there's already padding in the original
> allocation.

Given there's a maximum of 3 slots, I think it makes better sense to just
allocate them all up front.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 16:04 ` [PATCH 3/9] VFS: Introduce a mount context David Howells
  2017-05-03 18:13   ` Jeff Layton
  2017-05-03 21:43   ` Rasmus Villemoes
@ 2017-05-04 10:22   ` David Howells
  2017-05-08 15:05   ` Miklos Szeredi
  2017-05-08 22:57   ` David Howells
  4 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-04 10:22 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: dhowells, viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:

> > +	if (fs_type->fsopen && fs_type->mc_size < sizeof(*mc))
> > +		BUG();
> 
> So ->mc_size can be 0 (i.e. not explicitly initialized) if fs_type does
> not have ->fsopen. OK.

I need to be able to handle filesystems that don't support this yet.  Once all
filesystems support this, I would be able to take away the max_t() thing.

> > +	if (!src->ops->dup)
> > +		return ERR_PTR(-ENOTSUPP);
> > +
> > +	mc = kmemdup(src, src->fs_type->mc_size, GFP_KERNEL);
> 
> So this assumes that vfs_dup_mount_context is only used if ->mc_size is
> explicitly initialized. A max_t here as well probably wouldn't hurt.

If you don't provide an ->fsopen() op, you can't set src->ops, you don't see a
mount context and you can't call this function.  If you did supply an
->fsopen() op, the BUG() would've got you if you didn't set ->mc_size.

> > +	unsigned short mc_size;		/* Size of mount context to allocate */
> 
> Any particular reason to use a short? The struct doesn't pack any better.

But it leaves a hole someone else can use.  I try not to use fields larger
than I need to.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
                     ` (2 preceding siblings ...)
  2017-05-03 20:44   ` Rasmus Villemoes
@ 2017-05-04 10:40   ` Karel Zak
  2017-05-04 12:55   ` David Howells
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 66+ messages in thread
From: Karel Zak @ 2017-05-04 10:40 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Wed, May 03, 2017 at 05:05:08PM +0100, David Howells wrote:
> 	mfd = fsopen("ext4", -1, O_CLOEXEC);
> 	write(mfd, "d /dev/sdb1"); // note I'm ignoring write's length arg

Not sure about 'd', in many cases it is not device, for mount(2)
syscall we call it "source".

> 	write(mfd, "o noatime");
> 	write(mfd, "o acl");
> 	write(mfd, "o user_attr");
> 	write(mfd, "o iversion");
> 	write(mfd, "o ");
> 	write(mfd, "r /my/container"); // root inside the fs
> 	fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW);
> 
> 	mfd = fsopen("afs", -1);
> 	write(mfd, "d %grand.central.org:root.cell");
> 	write(mfd, "o cell=grand.central.org");
> 	write(mfd, "r /");
> 	fsmount(mfd, AT_FDCWD, "/mnt", 0);
> 
> If an error is reported at any step, an error message may be available to be
> read() back (ENODATA will be reported if there isn't an error available) in
> the form:
> 
> 	"e <subsys>:<problem>"
> 	"e SELinux:Mount on mountpoint not permitted"
> 
> Once fsmount() has been called, further write() calls will incur EBUSY,
> even if the fsmount() fails.  read() is still possible to retrieve error
> information.

The very basic mount(2) problem is that you have to parse
/proc/self/mountinfo to get information about the mounted filesystem.
It seems that your read() is also one way communication.

What we really need is to have a way how to specify *what* you want to
read. The error message is not enough, I want to know the finally used
mount options, mount ID, etc. It would be nice to have something like


   fsmount(mfd, AT_FDCWD, "/mnt", 0);

   write(mfd, "o");
   read(mfd, ....);     // read mount options

   write(mdf, "i");
   read(mfd, ....);     // read mount ID


but it seems ugly. Maybe introduce another function like 

    fsinfo(mdf, "o", buf, bufsz)

to get mount options (etc.) and to avoid separate write & read.


> Netlink is not used because it is optional.

 +1

    Karel
-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
                     ` (3 preceding siblings ...)
  2017-05-04 10:40   ` Karel Zak
@ 2017-05-04 12:55   ` David Howells
  2017-05-04 12:58   ` David Howells
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-04 12:55 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: dhowells, viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:

> OK, mc->error must be static data, so no lifetime problems.

Technically, mc->error is only good as long as we hold a ref on its module.
It might be better to copy the string, especially as we can then do
printf-style formatting.

> But is it possible for the compiler to mess this up and reload msg from
> mc->error when it's about to do the user copy, so that if some other thread
> has managed to change mc->error (or is the error state sticky and no further
> operations allowed?) we'd copy from a string with a different length?

Yeah - I need to put a READ_ONCE() in there.

> Where does fs_name now get freed? vfs_fsopen doesn't seem to do it on
> success? (If it did, the fallthrough from err_mc: to err_fs_name: would
> be wrong.)

I'll add a free right after the vfs_fsopen() call before checking the error -
then I can get rid of err_fs_name.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
                     ` (4 preceding siblings ...)
  2017-05-04 12:55   ` David Howells
@ 2017-05-04 12:58   ` David Howells
  2017-05-04 13:06   ` David Howells
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-04 12:58 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: dhowells, viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:

> > +const struct file_operations fs_fs_fops = {
> > +	.read		= fs_fs_read,
> > +	.write		= fs_fs_write,
> > +	.release	= fs_fs_release,
> > +	.llseek		= no_llseek,
> > +};
> > +
> 
> static const struct ?

No.  It's used in the next patch to validate the fd passed to sys_fsmount():

	if (f.file->f_op != &fs_fs_fops)
		goto err_fsfd;

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
                     ` (5 preceding siblings ...)
  2017-05-04 12:58   ` David Howells
@ 2017-05-04 13:06   ` David Howells
  2017-05-04 13:34     ` Karel Zak
  2017-05-08 15:10   ` Miklos Szeredi
  2017-05-08 23:09   ` David Howells
  8 siblings, 1 reply; 66+ messages in thread
From: David Howells @ 2017-05-04 13:06 UTC (permalink / raw)
  To: Karel Zak
  Cc: dhowells, viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

Karel Zak <kzak@redhat.com> wrote:

> > 	write(mfd, "d /dev/sdb1"); // note I'm ignoring write's length arg
> 
> Not sure about 'd', in many cases it is not device, for mount(2)
> syscall we call it "source".

sys_mount() calls it devname.  But whatever - I'm not particularly attached to
the letter 'd' for this.

> The very basic mount(2) problem is that you have to parse
> /proc/self/mountinfo to get information about the mounted filesystem.
> It seems that your read() is also one way communication.
> 
> What we really need is to have a way how to specify *what* you want to
> read. The error message is not enough, I want to know the finally used
> mount options, mount ID, etc. It would be nice to have something like
> 
> 
>    fsmount(mfd, AT_FDCWD, "/mnt", 0);
> 
>    write(mfd, "o");
>    read(mfd, ....);     // read mount options
> 
>    write(mdf, "i");
>    read(mfd, ....);     // read mount ID
> 
> 
> but it seems ugly. Maybe introduce another function like 
> 
>     fsinfo(mdf, "o", buf, bufsz)
> 
> to get mount options (etc.) and to avoid separate write & read.

What is it you're trying to do?  Just read back the state of the new mount?
Or read back the state of a specified extant mount?

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-04 13:06   ` David Howells
@ 2017-05-04 13:34     ` Karel Zak
  2017-05-09 18:40       ` Jeff Layton
  0 siblings, 1 reply; 66+ messages in thread
From: Karel Zak @ 2017-05-04 13:34 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Thu, May 04, 2017 at 02:06:51PM +0100, David Howells wrote:
> Karel Zak <kzak@redhat.com> wrote:
> > The very basic mount(2) problem is that you have to parse
> > /proc/self/mountinfo to get information about the mounted filesystem.
> > It seems that your read() is also one way communication.
> > 
> > What we really need is to have a way how to specify *what* you want to
> > read. The error message is not enough, I want to know the finally used
> > mount options, mount ID, etc. It would be nice to have something like
> > 
> > 
> >    fsmount(mfd, AT_FDCWD, "/mnt", 0);
> > 
> >    write(mfd, "o");
> >    read(mfd, ....);     // read mount options
> > 
> >    write(mdf, "i");
> >    read(mfd, ....);     // read mount ID
> > 
> > 
> > but it seems ugly. Maybe introduce another function like 
> > 
> >     fsinfo(mdf, "o", buf, bufsz)
> > 
> > to get mount options (etc.) and to avoid separate write & read.
> 
> What is it you're trying to do?  Just read back the state of the new mount?

 ...read back the state of the new mount, because for example mount
 options can be modified by FS driver. It would be also nice to have
 API to get state of arbitrary mount without parsing mountinfo (the
 file is huge on some systems).

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-04  9:27     ` David Howells
@ 2017-05-04 14:34       ` Joe Perches
  0 siblings, 0 replies; 66+ messages in thread
From: Joe Perches @ 2017-05-04 14:34 UTC (permalink / raw)
  To: David Howells
  Cc: Jeff Layton, viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Thu, 2017-05-04 at 10:27 +0100, David Howells wrote:
> Joe Perches <joe@perches.com> wrote:
> 
> > krealloc would probably be more efficient and possible
> > readable as likely there's already padding in the original
> > allocation.
> 
> Given there's a maximum of 3 slots, I think it makes better sense to just
> allocate them all up front.

Sounds good to me.
Simpler is frequently better.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH 0/9] VFS: Introduce mount context
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (10 preceding siblings ...)
  2017-05-03 16:50 ` David Howells
@ 2017-05-05 14:35 ` Miklos Szeredi
  2017-05-05 15:47 ` David Howells
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-05 14:35 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, lkml

On Wed, May 3, 2017 at 6:04 PM, David Howells <dhowells@redhat.com> wrote:
>
> Here are a set of patches to create a mount context prior to setting up a
> new mount, populating it with the parsed options/binary data and then
> effecting the mount.

Great work, thanks for taking this on.

I'd argue with some design decisions here.  One of the motivations for
doing the mount API overhaul is to create clear distinction between
separate functions like:

 - creating filesystem instance (aka superblock)

 - attaching filesystem instance into mount tree

 - reconfiguring superblock

 - changing mount properties


This patchset achieves this partly, but the separation is far from
crisp clear...  First of all why is fsopen() creating a "mount
context"?  It's suppsed to create a "superblock creation context".
And indeed, there are mount flags and root path in there, which are
definitely not necessary for creating a super block.

Is there a good reason why these mount specific properties leaked into
the object created by fsopen()?

Also I'd expect all context ops to be fully generic first.  I.e. no
filesystem code needs to be touched to make the new interface work.
The context would just build the option string and when everything is
ready (probably need a "commit" command) then it would go off and call
mount_fs() to create the superblock and attach it to the context.

Then, when that works, we could add context ops, so the filesystem can
do various things along the way, which is the other reason we want
this.  And in the end it would allow gradual migration to a new
superblock creation api and phasing out the old one.   But that
shouldn't be observable on either the old or the new userspace
interfaces.


> This allows namespaces and other information to be conveyed through the
> mount procedure.  It also allows extra error information to be returned
> (so many things can go wrong during a mount that a small integer isn't
> really sufficient to convey the issue).
>
> This also allows Miklós Szeredi's idea of doing:
>
>         fd = fsopen("nfs");
>         write(fd, "option=val", ...);
>         fsmount(fd, "/mnt");
>
> that he presented at LSF-2017 to be implemented (see the relevant patches
> in the series), to which I can add:
>
>         read(fd, error_buffer, ...);
>
> to read back any error message.  I didn't use netlink as that would make it
> depend on CONFIG_NET and would introduce network namespacing issues.
>
> I've implemented mount context handling for procfs and nfs.
>
> Further developments:
>
>  (*) Implement mount context support in more filesystems, ext4 being next
>      on my list.
>
>  (*) Move the walk-from-root stuff that nfs has to generic code so that you
>      can do something akin to:
>
>         mount /dev/sda1:/foo/bar /mnt
>
>      See nfs_follow_remote_path() and mount_subtree().  This is slightly
>      tricky in NFS as we have to prevent referral loops.

First we can limit this feature to non-weird (ie. no managed dentries) subtrees.

>
>  (*) Move the pid_ns pointer from struct mount_context to struct
>      proc_mount_context as I'm not sure it's necessary for anything other
>      than procfs.
>
>  (*) Work out how to get at the error message incurred by submounts
>      encountered during nfs_follow_remote_path().
>
>      Should the error message be moved to task_struct and made more
>      general, perhaps retrieved with a prctl() function?
>
>  (*) Clean up/consolidate the security functions.  Possibly add a
>      validation hook to be called at the same time as the mount context
>      validate op.
>
> The patches can be found here also:
>
>         http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=mount-context

Will try to review the actual patches next week.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH 0/9] VFS: Introduce mount context
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (11 preceding siblings ...)
  2017-05-05 14:35 ` Miklos Szeredi
@ 2017-05-05 15:47 ` David Howells
  2017-05-08  8:25   ` Miklos Szeredi
  2017-05-08  8:35 ` David Howells
  2017-05-08 17:03 ` Djalal Harouni
  14 siblings, 1 reply; 66+ messages in thread
From: David Howells @ 2017-05-05 15:47 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dhowells, viro, linux-fsdevel, linux-nfs, lkml

Miklos Szeredi <mszeredi@redhat.com> wrote:

> I'd argue with some design decisions here.  One of the motivations for
> doing the mount API overhaul is to create clear distinction between
> separate functions like:
> 
>  - creating filesystem instance (aka superblock)
> 
>  - attaching filesystem instance into mount tree
> 
>  - reconfiguring superblock
> 
>  - changing mount properties

I definitely agree that keeping a separation between vfsmount manipulation
(add, bind, move, ...) and superblock manipulation (create, remount) is a good
idea.

However, creating new superblocks and remounting superblocks have a lot in
common, including the option parsing.  Note also that existing code is
somewhat lazy about rejecting parameters that can't be changed with a remount
and will ignore some attempted changes.  We have to retain this behaviour, at
least for the normal mount() system call.

Note that one of the main reasons I'm working on this is namespace
propagation, particularly with respect to automounts.

> This patchset achieves this partly, but the separation is far from
> crisp clear...  First of all why is fsopen() creating a "mount
> context"?  It's suppsed to create a "superblock creation context".

I've no particular objection to renaming struct mount_context to something
else, but it also needs to handle remount because of the commonality.

Further, once you've created a superblock, what are you going to do with it
other than mount it?  I suppose you could statfs it and we could add other
superblock manipulation functions, but this is normally done by opening the
device directly (at least for bdev-based superblocks).

> And indeed, there are mount flags and root path in there, which are
> definitely not necessary for creating a super block.

Erm, that's not strictly true.

Some filesystems (eg. nfs, ocfs2, lustre) want to know about certain MNT_xxx
flags, such as MNT_NOATIME and MNT_READONLY.

Further, the root path might be necessary for the mount - see NFS for example.
What I was thinking of, say for NFS, is splitting the source name up front,
so:

	my.nfs.org:/my/home/dir

into:

	mc->device = "my.nfs.org";
	mc->root_path = "/my/home/dir";

and then having the VFS handle the root walk rather than doing it inside NFS.
This facility could then become available to other filesystems potentially.

However, with the case on NFS, you may need to hand the root path off to a
mount server.

> Is there a good reason why these mount specific properties leaked into
> the object created by fsopen()?

Answered above.  I'm okay with removing remove root_path from the context for
the moment.  It's something that can be revisited later.

We also might need to remove usage of MNT_xxx flags from filesystems.

> Also I'd expect all context ops to be fully generic first.  I.e. no
> filesystem code needs to be touched to make the new interface work.
> The context would just build the option string and when everything is
> ready (probably need a "commit" command) then it would go off and call
> mount_fs() to create the superblock and attach it to the context.

That should be easy enough to add as a fallback.

> Then, when that works, we could add context ops, so the filesystem can
> do various things along the way, which is the other reason we want
> this.  And in the end it would allow gradual migration to a new
> superblock creation api and phasing out the old one.

I'm not sure the context ops are so easily to add gradually.

> But that shouldn't be observable on either the old or the new userspace
> interfaces.

Almost a fair point - but it can be observed by pushing in more than a page's
worth of options.  What I have now for NFS will still work with
fsopen()/write()/fsmount() whereas mount() won't.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH 0/9] VFS: Introduce mount context
  2017-05-05 15:47 ` David Howells
@ 2017-05-08  8:25   ` Miklos Szeredi
  0 siblings, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-08  8:25 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, lkml

On Fri, May 5, 2017 at 5:47 PM, David Howells <dhowells@redhat.com> wrote:
> Miklos Szeredi <mszeredi@redhat.com> wrote:
>
>> I'd argue with some design decisions here.  One of the motivations for
>> doing the mount API overhaul is to create clear distinction between
>> separate functions like:
>>
>>  - creating filesystem instance (aka superblock)
>>
>>  - attaching filesystem instance into mount tree
>>
>>  - reconfiguring superblock
>>
>>  - changing mount properties
>
> I definitely agree that keeping a separation between vfsmount manipulation
> (add, bind, move, ...) and superblock manipulation (create, remount) is a good
> idea.
>
> However, creating new superblocks and remounting superblocks have a lot in
> common, including the option parsing.  Note also that existing code is
> somewhat lazy about rejecting parameters that can't be changed with a remount
> and will ignore some attempted changes.  We have to retain this behaviour, at
> least for the normal mount() system call.
>
> Note that one of the main reasons I'm working on this is namespace
> propagation, particularly with respect to automounts.
>
>> This patchset achieves this partly, but the separation is far from
>> crisp clear...  First of all why is fsopen() creating a "mount
>> context"?  It's suppsed to create a "superblock creation context".
>
> I've no particular objection to renaming struct mount_context to something
> else, but it also needs to handle remount because of the commonality.

Definitely agree about having the same object handle filesystem
creation and filesystem reconfiguration.  And yeah, I think naming
does have to be changed, simply because the old way of doing things is
so ingrained in everything in this area.

> Further, once you've created a superblock, what are you going to do with it
> other than mount it?  I suppose you could statfs it and we could add other
> superblock manipulation functions, but this is normally done by opening the
> device directly (at least for bdev-based superblocks).

It surely makes sense to mount it, but that does not mean we need to
blur the boundary.

>> And indeed, there are mount flags and root path in there, which are
>> definitely not necessary for creating a super block.
>
> Erm, that's not strictly true.
>
> Some filesystems (eg. nfs, ocfs2, lustre) want to know about certain MNT_xxx
> flags, such as MNT_NOATIME and MNT_READONLY.

So sometimes filesystems have access to mount flags that tell about
the mount that the operation was done through (the only relevant op is
getattr(), right?).

But that doesn't mean the filesystem has any business looking at the
mount flags that the initial mount will be created with (it definitely
does not).

> Further, the root path might be necessary for the mount - see NFS for example.
> What I was thinking of, say for NFS, is splitting the source name up front,
> so:
>
>         my.nfs.org:/my/home/dir
>
> into:
>
>         mc->device = "my.nfs.org";
>         mc->root_path = "/my/home/dir";
>
> and then having the VFS handle the root walk rather than doing it inside NFS.
> This facility could then become available to other filesystems potentially.

Ah, I see what you are saying:  we don't have the infrastructure
currently in the VFS to handle NFS subdir mounts, but want to
introduce this in the interface now (because NFS *can* handle it), so
that when we will have the VFS infrastructure we can silently switch
to it.

> However, with the case on NFS, you may need to hand the root path off to a
> mount server.

Hmm... IMO the right way to do this would be to move the NFS subdir
code first to the VFS and then introduce the new interface with the
correct semantics (i.e. subdir is handled on mount not on superblock
creation).

I haven't looked at what NFS is doing, but subdir mounts should be
just like plain bind mounts, no?  The special thing here is that we
need to do that bind mount before the filesystem is initially mounted.
But that can be done by

 - first doing an kernel-internal mount
 - doing the pathwalk on that (which can trigger more internal mounts)
 - attaching the found source to the target mountpoint
 - finally discarding the kernel internal mount tree

>> Is there a good reason why these mount specific properties leaked into
>> the object created by fsopen()?
>
> Answered above.  I'm okay with removing remove root_path from the context for
> the moment.  It's something that can be revisited later.
>
> We also might need to remove usage of MNT_xxx flags from filesystems.

There is no MNT_xxx flag usage in superblock creation, because
mnt_flags are not passed to filesystems and the relevant MS_xxx ones
are cleared from the flags that *are* passed to the fs.

MS_RDONLY is special, because it's a mount flag as well as a
superblock flag.  No problem there, with the new interface it will be
possible to set them separately (e.g. ro on sb and rw on mnt or vice
versa).

>> But that shouldn't be observable on either the old or the new userspace
>> interfaces.
>
> Almost a fair point - but it can be observed by pushing in more than a page's
> worth of options.  What I have now for NFS will still work with
> fsopen()/write()/fsmount() whereas mount() won't.

Alright, lets just stay, that everything that works now should work
the same way on the old as well as the new interfaces.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH 0/9] VFS: Introduce mount context
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (12 preceding siblings ...)
  2017-05-05 15:47 ` David Howells
@ 2017-05-08  8:35 ` David Howells
  2017-05-08  8:43   ` Miklos Szeredi
  2017-05-08 17:03 ` Djalal Harouni
  14 siblings, 1 reply; 66+ messages in thread
From: David Howells @ 2017-05-08  8:35 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dhowells, viro, linux-fsdevel, linux-nfs, lkml

David Howells <dhowells@redhat.com> wrote:

> > This patchset achieves this partly, but the separation is far from
> > crisp clear...  First of all why is fsopen() creating a "mount
> > context"?  It's suppsed to create a "superblock creation context".
> 
> I've no particular objection to renaming struct mount_context to something
> else, but it also needs to handle remount because of the commonality.
> 
> Further, once you've created a superblock, what are you going to do with it
> other than mount it?  I suppose you could statfs it and we could add other
> superblock manipulation functions, but this is normally done by opening the
> device directly (at least for bdev-based superblocks).

How about sb_context, sb_config, sb_parameters or something like that?

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH 0/9] VFS: Introduce mount context
  2017-05-08  8:35 ` David Howells
@ 2017-05-08  8:43   ` Miklos Szeredi
  0 siblings, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-08  8:43 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, lkml

On Mon, May 8, 2017 at 10:35 AM, David Howells <dhowells@redhat.com> wrote:

>> Further, once you've created a superblock, what are you going to do with it
>> other than mount it?  I suppose you could statfs it and we could add other
>> superblock manipulation functions, but this is normally done by opening the
>> device directly (at least for bdev-based superblocks).
>
> How about sb_context, sb_config, sb_parameters or something like that?

I'd vote for sb_config.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 16:04 ` [PATCH 3/9] VFS: Introduce a mount context David Howells
                     ` (2 preceding siblings ...)
  2017-05-04 10:22   ` David Howells
@ 2017-05-08 15:05   ` Miklos Szeredi
  2017-05-08 22:57   ` David Howells
  4 siblings, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-08 15:05 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, lkml

On Wed, May 3, 2017 at 6:04 PM, David Howells <dhowells@redhat.com> wrote:
> Introduce a mount context concept.  This is allocated at the beginning of
> the mount procedure and into it is placed:
>
>  (1) Filesystem type.
>
>  (2) Namespaces.
>
>  (3) Device name.
>
>  (4) Superblock flags (MS_*) and mount flags (MNT_*).
>
>  (5) Security details.
>
>  (6) Filesystem-specific data, as set by the mount options.
>
> It also gives a place in which to hang an error message for later retrieval
> (see the mount-by-fd syscall later in this series).
>
> Rather than calling fs_type->mount(), a mount_context struct is created and
> fs_type->fsopen() is called to set it up.  fs_type->mc_size says how much
> should be added on to the mount context for the filesystem's use.
>
> A set of operations have to be set by ->fsopen() to provide freeing,
> duplication, option parsing, binary data parsing, validation, mounting and
> superblock filling.
>
> It should be noted that, whilst this patch adds a lot of lines of code,
> there is quite a bit of duplication with existing code that can be
> eliminated should all filesystems be converted over.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
>  Documentation/filesystems/mounting.txt |  445 ++++++++++++++++++++++++++++++++
>  fs/Makefile                            |    3
>  fs/internal.h                          |    2
>  fs/mount.h                             |    3
>  fs/mount_context.c                     |  343 +++++++++++++++++++++++++
>  fs/namespace.c                         |  270 +++++++++++++++++--
>  fs/super.c                             |   50 +++-
>  include/linux/fs.h                     |   11 +
>  include/linux/lsm_hooks.h              |   37 +++
>  include/linux/mount.h                  |   67 +++++
>  include/linux/security.h               |   29 ++
>  security/security.c                    |   32 ++
>  security/selinux/hooks.c               |  179 +++++++++++++
>  13 files changed, 1435 insertions(+), 36 deletions(-)
>  create mode 100644 Documentation/filesystems/mounting.txt
>  create mode 100644 fs/mount_context.c
>
> diff --git a/Documentation/filesystems/mounting.txt b/Documentation/filesystems/mounting.txt
> new file mode 100644
> index 000000000000..a942ccd08376
> --- /dev/null
> +++ b/Documentation/filesystems/mounting.txt
> @@ -0,0 +1,445 @@
> +                             ===================
> +                             FILESYSTEM MOUNTING
> +                             ===================
> +
> +CONTENTS
> +
> + (1) Overview.
> +
> + (2) The mount context.
> +
> + (3) The mount context operations.
> +
> + (4) Mount context security.
> +
> + (5) VFS mount context operations.
> +
> +
> +========
> +OVERVIEW
> +========
> +
> +The creation of new mounts is now to be done in a multistep process:
> +
> + (1) Create a mount context.
> +
> + (2) Parse the options and attach them to the mount context.  Options may be
> +     passed individually from userspace.
> +
> + (3) Validate and pre-process the mount context.

(3.5) Create super block

I think this need to be triggered by something like a "commit" command
from userspace.  Basically this is where the options are atomically
set on the new (create) or existing (reconfigure) superblock.

> +
> + (4) Perform the mount.
> +
> + (5) Return an error message attached to the mount context.

Swap the order of the above.  There's no fs specific actions performed
at fsmount() time, and normal errno reporting should be perfectly
fine.

> +
> + (6) Destroy the mount context.
> +
> +To support this, the file_system_type struct gains two new fields:
> +
> +       unsigned short mc_size;
> +
> +which indicates how much space the filesystem would like tacked onto the end of
> +the mount_context struct for its own purposes, and:
> +
> +       int (*fsopen)(struct mount_context *mc, struct super_block *src_sb);
> +
> +which is invoked to set up the filesystem-specific parts of a mount context,
> +including the additional space.  The src_sb parameter is used to convey the
> +superblock from which the filesystem may draw extra information (such as
> +namespaces), for submount (MS_SUBMOUNT) or remount (MS_REMOUNT) purposes or it
> +will be NULL.

I think reconfigure (don't call it remount, there's no "mounting"
going on there) should start out with a context populated with with
the current state of the superblock.  User can then reset and start
over or individually add/remove options.   This should be a good place
to allow querying the options as well, as Karel suggested.  Then when
the configuration is finished the changes are committed to the
superblock.

> +
> +Note that security initialisation is done *after* the filesystem is called so
> +that the namespaces may be adjusted first.
> +
> +And the super_operations struct gains one:
> +
> +       int (*remount_fs_mc) (struct super_block *, struct mount_context *);
> +
> +This shadows the ->remount_fs() operation and takes a prepared mount context
> +instead of the mount flags and data page.  It may modify the ms_flags in the
> +context for the caller to pick up.
> +
> +[NOTE] remount_fs_mc is intended as a replacement for remount_fs.
> +
> +
> +=================
> +THE MOUNT CONTEXT
> +=================
> +
> +The mount process is governed by a mount context.  This is represented by the
> +mount_context structure:
> +
> +       struct mount_context {
> +               const struct mount_context_operations *ops;
> +               struct file_system_type *fs;
> +               struct user_namespace   *user_ns;
> +               struct mnt_namespace    *mnt_ns;
> +               struct pid_namespace    *pid_ns;
> +               struct net              *net_ns;
> +               const struct cred       *cred;
> +               char                    *device;
> +               char                    *root_path;
> +               void                    *security;
> +               const char              *error;
> +               unsigned int            ms_flags;
> +               unsigned int            mnt_flags;
> +               bool                    mounted;
> +               bool                    sloppy;
> +               bool                    silent;
> +               enum mount_type         mount_type : 8;
> +       };
> +
> +When allocated, the mount_context struct is extended by ->mc_size bytes as
> +specified by the specified file_system_type struct.  This is for use by the
> +filesystem.  The filesystem should wrap the struct in its own, e.g.:
> +
> +       struct nfs_mount_context {
> +               struct mount_context mc;
> +               ...
> +       };
> +
> +placing the mount_context struct first.  container_of() can then be used.
> +
> +The mount_context fields are as follows:
> +
> + (*) const struct mount_context_operations *ops
> +
> +     These are operations that can be done on a mount context.  See below.
> +     This must be set by the ->fsopen() file_system_type operation.
> +
> + (*) struct file_system_type *fs
> +
> +     A pointer to the file_system_type of the filesystem that is being
> +     mounted.  This retains a ref on the type owner.
> +
> + (*) struct user_namespace *user_ns
> + (*) struct mnt_namespace *mnt_ns
> + (*) struct pid_namespace *pid_ns
> + (*) struct net *net_ns
> +
> +     This is a subset of the namespaces in use by the invoking process.  This
> +     retains a ref on each namespace.  The subscribed namespaces may be
> +     replaced by the filesystem to reflect other sources, such as the parent
> +     mount superblock on an automount.
> +
> + (*) struct cred *cred
> +
> +     The mounter's credentials.  This retains a ref on the credentials.
> +
> + (*) char *device
> +
> +     This is the device to be mounted.  It may be a block device
> +     (e.g. /dev/sda1) or something more exotic, such as the "host:/path" that
> +     NFS desires.
> +
> + (*) char *root_path
> +
> +     A path to the place inside the filesystem to actually mount.  This allows
> +     a mount and bind-mount to be combined.
> +
> +     [NOTE] This isn't implemented yet, but NFS has the code to do this which
> +     could be moved to the VFS.
> +
> + (*) void *security
> +
> +     A place for the LSMs to hang their security data for the mount.  The
> +     relevant security operations are described below.
> +
> + (*) const char *error
> +
> +     A place for the VFS and the filesystem to hang an error message.  This
> +     should be in the form of a static string that doesn't need deallocation
> +     and the pointer to which can just be overwritten.  Under some
> +     circumstances, this can be retrieved by userspace.
> +
> +     Note that the existence of the error string is expected to be guaranteed
> +     by the reference on the file_system_type object held by ->fs or any
> +     filesystem-specific reference held in the filesystem context until the
> +     ->free() operation is called.
> +
> + (*) unsigned int ms_flags
> + (*) unsigned int mnt_flags
> +
> +     These hold the mount flags.  ms_flags holds MS_* flags and mnt_flags holds
> +     MNT_* flags.
> +
> + (*) bool mounted
> +
> +     This is set to true once a mount attempt is made.  This causes an error to
> +     be given on subsequent mount attempts with the same context and prevents
> +     multiple mount attempts.

No point.  A context is mountable if the superblock is non-NULL.
Don't even need to have the context committed, if not, it would simply
mount the sb in the previous state.

I'd hope some simplifications would fall out from this model.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
                     ` (6 preceding siblings ...)
  2017-05-04 13:06   ` David Howells
@ 2017-05-08 15:10   ` Miklos Szeredi
  2017-05-08 23:09   ` David Howells
  8 siblings, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-08 15:10 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, lkml

On Wed, May 3, 2017 at 6:05 PM, David Howells <dhowells@redhat.com> wrote:
> Provide an fsopen() system call that starts the process of preparing to
> mount, using an fd as a context handle.  fsopen() is given the name of the
> filesystem that will be used:
>
>         int mfd = fsopen(const char *fsname, int reserved,
>                          int open_flags);
>
> where reserved should be -1 for the moment (it will be used to pass the
> namespace information in future) and open_flags can be 0 or O_CLOEXEC.

Someone also suggested using /dev/fs/${FSTYPE} to open the fsfd.  I
realize that does not have the namespace info that you also want to
add, but wondering if that really has to come from open and cannot be
set later?

Alternatives are /proc/fs/${FSTYPE}/dev or /sys/fs/${FSTYPE}/dev.

Obviously neither can be used for bootstraping but there's still old
mount(2) for that.

I haven't convinced myself whether using plain open(2) or a new
fsopen(2) syscall is better, just mentioning that this is a
possibility as well.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH 0/9] VFS: Introduce mount context
  2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
                   ` (13 preceding siblings ...)
  2017-05-08  8:35 ` David Howells
@ 2017-05-08 17:03 ` Djalal Harouni
  14 siblings, 0 replies; 66+ messages in thread
From: Djalal Harouni @ 2017-05-08 17:03 UTC (permalink / raw)
  To: David Howells
  Cc: Alexander Viro, Linux FS Devel, linux-nfs, linux-kernel, Miklos Szeredi

On Wed, May 3, 2017 at 6:04 PM, David Howells <dhowells@redhat.com> wrote:
>
> Here are a set of patches to create a mount context prior to setting up a
> new mount, populating it with the parsed options/binary data and then
> effecting the mount.
>
> This allows namespaces and other information to be conveyed through the
> mount procedure.  It also allows extra error information to be returned
> (so many things can go wrong during a mount that a small integer isn't
> really sufficient to convey the issue).
>
> This also allows Miklós Szeredi's idea of doing:
>
>         fd = fsopen("nfs");
>         write(fd, "option=val", ...);
>         fsmount(fd, "/mnt");


This may help to clear the boundary between what you can do with a
vfsmount (bind) and the filesystem. In containers, orchestration
tools, etc bind mounts are treated in a dynamic way, there is
assumption on github where developers and users expect that they can
dynamically add/move mounts between namespaces, however this won't
work with userns, so maybe this will help... My other suggestions:
Clear documentation and code comments will really help! I posted and
used some UID shifting within VFS layer patches a year ago, and it
seems that they really need something like this... !

I'm not sure where I did read about netlink, but at least it should
count userspace capabilities and namespace privacy/context...

> that he presented at LSF-2017 to be implemented (see the relevant patches
> in the series), to which I can add:
>
>         read(fd, error_buffer, ...);
>
> to read back any error message.  I didn't use netlink as that would make it
> depend on CONFIG_NET and would introduce network namespacing issues.
>
> I've implemented mount context handling for procfs and nfs.
>
> Further developments:
>
>  (*) Implement mount context support in more filesystems, ext4 being next
>      on my list.
>
>  (*) Move the walk-from-root stuff that nfs has to generic code so that you
>      can do something akin to:
>
>         mount /dev/sda1:/foo/bar /mnt
>
>      See nfs_follow_remote_path() and mount_subtree().  This is slightly
>      tricky in NFS as we have to prevent referral loops.
>
>  (*) Move the pid_ns pointer from struct mount_context to struct
>      proc_mount_context as I'm not sure it's necessary for anything other
>      than procfs.

FWIW the RFC "proc: support private proc instances per pidnamespace"
[1] that I have to clean will hide pid_ns under procfs filesystem, so
maybe that's a good reason to move it then get rid of it.

Thanks!


[1] https://lkml.org/lkml/2017/4/25/282

-- 
tixxdz

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-03 16:04 ` [PATCH 3/9] VFS: Introduce a mount context David Howells
                     ` (3 preceding siblings ...)
  2017-05-08 15:05   ` Miklos Szeredi
@ 2017-05-08 22:57   ` David Howells
  2017-05-09  8:03     ` Miklos Szeredi
                       ` (3 more replies)
  4 siblings, 4 replies; 66+ messages in thread
From: David Howells @ 2017-05-08 22:57 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dhowells, viro, linux-fsdevel, linux-nfs, lkml

Miklos Szeredi <mszeredi@redhat.com> wrote:

> > + (3) Validate and pre-process the mount context.
> 
> (3.5) Create super block
> 
> I think this need to be triggered by something like a "commit" command
> from userspace.  Basically this is where the options are atomically
> set on the new (create) or existing (reconfigure) superblock.

Why do you need to expose this step to userspace?  Assuming in the "new" case
you do, say:

	fd = fsopen("nfs");
	write(fd, "s foo.bar:/bar", ...);
	write(fd, "o intr", ...);
	write(fd, "o fsc", ...);
	...
	write(fd, "c", ...); /* commit operation to get a superblock */
	fsmount(fd, AT_FDCWD, "/mnt");  /* mount the superblock we just got */

Then the "commit" op is dissimilar to "mount -o remount" since remount may
alter the superblock parameters *and* the mountpoint parameters, but commit
can only affect the superblock.

On the other hand, I could see that you might want to do:

	fd = fsopen("nfs");
	...
	write(fd, "c", ...); /* commit operation to get a superblock */
	fstatfs(fd, &buf); /* get info about the superblock */
	fsmount(fd, AT_FDCWD, "/mnt");  /* mount the superblock we just got */

> > + (4) Perform the mount.
> > +
> > + (5) Return an error message attached to the mount context.
> 
> Swap the order of the above.  There's no fs specific actions performed
> at fsmount() time, and normal errno reporting should be perfectly
> fine.

There's no reason not to allow error messages to be attached by the actual
vfsmount creation and insertion - and reasons that one might want to do so.
Think LSMs, for instance.  We don't look up the mountpoint until this point,
and so we can't do the security checks on them till this point.  It could make
it easier to debug problems if we can return a more comprehensive message at
this point.

> I think reconfigure (don't call it remount, there's no "mounting"
> going on there)

There's adjustment of the vfsmount structure too; besides, it is called
MS_REMOUNT in the UAPI and "mount -o remount", so we're somewhat stuck with
the label whether we like it or not.

> should start out with a context populated with with the current state of the
> superblock.

Hence why ->fsopen() takes a super_block parameter.

> User can then reset and start over

No, not really.  You cannot reset all options - the source for example,
probably has to remain the same.  IP addresses on NFS mounts possibly should
remain the same - though I can see situations where it might be convenient to
change these.

> or individually add/remove options.

This is very per-filesystem-type dependent.

> This should be a good place to allow querying the options as well, as Karel
> suggested.

I'm not sure it's worth the code unless we allow opening extant mounts and
querying using this mechanism.

> Then when the configuration is finished the changes are committed to the
> superblock.

You're going a lot beyond remount here.  Remount can, in one go, change some
options which are superblock-only, some options which are mountpoint-only and
at least one which crosses both domains.

> > + (*) bool mounted
> > +
> > +     This is set to true once a mount attempt is made.  This causes an error to
> > +     be given on subsequent mount attempts with the same context and prevents
> > +     multiple mount attempts.
> 
> No point.  A context is mountable if the superblock is non-NULL.
> Don't even need to have the context committed,

Ummm...  Doesn't that render "commit" unnecessary?

> if not, it would simply mount the sb in the previous state.

You want to be able to open a filesystem fd, create or reference a superblock
and then mount it several times?

> I'd hope some simplifications would fall out from this model.

Not really.  It makes things slightly less simple, particularly with the
"commit" operation that you want.  I'm not sure that sys_mount() and
sys_fsmount() will be able to share as much code.

It also makes the remount process less similar to the mount process because
the "commit" operation doesn't seem useful in the former because remount also
alters the vfsmount.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
                     ` (7 preceding siblings ...)
  2017-05-08 15:10   ` Miklos Szeredi
@ 2017-05-08 23:09   ` David Howells
  8 siblings, 0 replies; 66+ messages in thread
From: David Howells @ 2017-05-08 23:09 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dhowells, viro, linux-fsdevel, linux-nfs, lkml

Miklos Szeredi <mszeredi@redhat.com> wrote:

> Someone also suggested using /dev/fs/${FSTYPE} to open the fsfd.

The downside of using open() for this is that you then have a chicken-and-egg
problem with respect to booting as you point out.

> I realize that does not have the namespace info that you also want to add,
> but wondering if that really has to come from open and cannot be set later?

When do you do the security checks?  Those are going to be affected by the
namespaces.  Other things are as well, such as setting hostnames, IP
addresses, device file paths and default UIDs/GIDs, but these are probably
more okay with being deferred to the parameter validation step.

> ALTERNATIVES are /proc/fs/${FSTYPE}/dev or /sys/fs/${FSTYPE}/dev.
> 
> Obviously neither can be used for bootstraping but there's still old
> mount(2) for that.

It should also be possible to build-time disable mount(2) in future.
Obviously, this would mean providing other vectors for the other functions of
mount(2).

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-08 22:57   ` David Howells
@ 2017-05-09  8:03     ` Miklos Szeredi
  2017-05-10 12:41       ` Karel Zak
  2017-05-09  9:32     ` David Howells
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-09  8:03 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, lkml

On Tue, May 9, 2017 at 12:57 AM, David Howells <dhowells@redhat.com> wrote:
> Miklos Szeredi <mszeredi@redhat.com> wrote:
>
>> > + (3) Validate and pre-process the mount context.
>>
>> (3.5) Create super block
>>
>> I think this need to be triggered by something like a "commit" command
>> from userspace.  Basically this is where the options are atomically
>> set on the new (create) or existing (reconfigure) superblock.
>
> Why do you need to expose this step to userspace?  Assuming in the "new" case
> you do, say:
>
>         fd = fsopen("nfs");
>         write(fd, "s foo.bar:/bar", ...);
>         write(fd, "o intr", ...);
>         write(fd, "o fsc", ...);
>         ...
>         write(fd, "c", ...); /* commit operation to get a superblock */
>         fsmount(fd, AT_FDCWD, "/mnt");  /* mount the superblock we just got */
>
> Then the "commit" op is dissimilar to "mount -o remount" since remount may
> alter the superblock parameters *and* the mountpoint parameters, but commit
> can only affect the superblock.

Forget remount, it's a historical remnant.  We need fsreconfig(sb) and
setmntattr(mnt).  They are changing properties of different objects.
Remount is like fcntl(fd, F_SETFL) and fchmod(fd, ...) rolled into
one.   They have nothing in common except the fact that the old
mount(2) API included both in one single operation, and I'm sure that
was a "oh we don't want to introduce a new flag for this, so lets
reuse the old one" sort of design decision.

>
> On the other hand, I could see that you might want to do:
>
>         fd = fsopen("nfs");
>         ...
>         write(fd, "c", ...); /* commit operation to get a superblock */
>         fstatfs(fd, &buf); /* get info about the superblock */
>         fsmount(fd, AT_FDCWD, "/mnt");  /* mount the superblock we just got */
>
>> > + (4) Perform the mount.
>> > +
>> > + (5) Return an error message attached to the mount context.
>>
>> Swap the order of the above.  There's no fs specific actions performed
>> at fsmount() time, and normal errno reporting should be perfectly
>> fine.
>
> There's no reason not to allow error messages to be attached by the actual
> vfsmount creation and insertion - and reasons that one might want to do so.
> Think LSMs, for instance.  We don't look up the mountpoint until this point,
> and so we can't do the security checks on them till this point.  It could make
> it easier to debug problems if we can return a more comprehensive message at
> this point.

I think that's crazy.  We don't return detailed errors for any other
syscall for path lookup, so why would path lookup for mount be
special.

And why would

    fd = open("/foo/bar", O_PATH);
    fsmount(fsfd, fd, NULL);

behave differently from

    fsmount(fsfd, -1, "/foo/bar");

?

>
>> I think reconfigure (don't call it remount, there's no "mounting"
>> going on there)
>
> There's adjustment of the vfsmount structure too; besides, it is called
> MS_REMOUNT in the UAPI and "mount -o remount", so we're somewhat stuck with
> the label whether we like it or not.

Oh, uapi compatibility: they can use the old mount(2) API for that and
introduce saner utils for the new stuff.  I'm sure we don't need to be
100% feature compatible with old  mount(2).

What we need is mount(2) to stay 100% compatible with itself while the
kernel internal APIs are reshuffled.


>> should start out with a context populated with with the current state of the
>> superblock.
>
> Hence why ->fsopen() takes a super_block parameter.
>
>> User can then reset and start over
>
> No, not really.  You cannot reset all options - the source for example,
> probably has to remain the same.  IP addresses on NFS mounts possibly should
> remain the same - though I can see situations where it might be convenient to
> change these.

Well, the current remount API is like that: you give a new set of
options (i.e. reset and replace anything that can be changed and leave
the rest).  Obviously "reset options" wouldn't allow you to change
options that cannot be changed.

>
>> or individually add/remove options.
>
> This is very per-filesystem-type dependent.

So say we have commands like

"o+ foo"
"o- bar"

The generic option parser would just add or remove the option in the
current set of options, and commit would just call ->remount_fs() with
the new set of options.  It would probably not work for the NFS case,
but that's okay, NFS can implement its own option parsing.

>> This should be a good place to allow querying the options as well, as Karel
>> suggested.
>
> I'm not sure it's worth the code unless we allow opening extant mounts and
> querying using this mechanism.

I'm saying we should allow opening an existent superblock and allow
query and change of options.

>> Then when the configuration is finished the changes are committed to the
>> superblock.
>
> You're going a lot beyond remount here.  Remount can, in one go, change some
> options which are superblock-only, some options which are mountpoint-only and
> at least one which crosses both domains.

We'll have s_op->remount_fs() for some time yet, but that's a clean,
super-block only operation.  MS_REMOUNT is a flag from hell, leave
that for mount(2) compatibility and forget it for the new API.

>
>> > + (*) bool mounted
>> > +
>> > +     This is set to true once a mount attempt is made.  This causes an error to
>> > +     be given on subsequent mount attempts with the same context and prevents
>> > +     multiple mount attempts.
>>
>> No point.  A context is mountable if the superblock is non-NULL.
>> Don't even need to have the context committed,
>
> Ummm...  Doesn't that render "commit" unnecessary?

No.   "commit" is a superblock operation.  fsmount() is a mount
operation.    fsmount() should not do anything to the superblock and
"commit" should not do anything to any mount.

>> if not, it would simply mount the sb in the previous state.
>
> You want to be able to open a filesystem fd, create or reference a superblock
> and then mount it several times?

Maybe.  I'm looking at this from the point of view of what objects we
have and what operations we want to do on them.  In that view it makes
no sense that fsmount() changes the state of the fsfd, since it's an
operation done on the mount tree and not on the superblock
configuration context.  The fact that you can do any number of mounts
on the fsfd just falls out from this premise.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-08 22:57   ` David Howells
  2017-05-09  8:03     ` Miklos Szeredi
@ 2017-05-09  9:32     ` David Howells
  2017-05-09 11:04       ` Miklos Szeredi
  2017-05-09  9:41     ` David Howells
  2017-05-09  9:56     ` David Howells
  3 siblings, 1 reply; 66+ messages in thread
From: David Howells @ 2017-05-09  9:32 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dhowells, viro, linux-fsdevel, linux-nfs, lkml

Miklos Szeredi <mszeredi@redhat.com> wrote:

> Forget remount, it's a historical remnant.

I don't think it can't be set aside so lightly.  Within the kernel, the option
parsing should share as much code as possible between new superblock config,
old new mount and old remount.

The 'trickiest' function we need to support is MS_RDONLY flipping.  That one
affects both the mount and the superblock.  I think all the rest only affect
one side or the other.

Given that a superblock can be mounted in multiple places, do we need to count
the number of read-only mounts that are holding a particular superblock and
only flip the superblock when they're all read-only?

Or do you advocate replacing "mount -o remount,[ro|rw]" with a pair of
operations - one to flip the mount and the other to flip the superblock?

Further, "emergency remount r/o" needs to be supported - though it might make
sense to add a special op just for that.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-08 22:57   ` David Howells
  2017-05-09  8:03     ` Miklos Szeredi
  2017-05-09  9:32     ` David Howells
@ 2017-05-09  9:41     ` David Howells
  2017-05-09 12:02       ` Miklos Szeredi
  2017-05-09  9:56     ` David Howells
  3 siblings, 1 reply; 66+ messages in thread
From: David Howells @ 2017-05-09  9:41 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dhowells, viro, linux-fsdevel, linux-nfs, lkml

Miklos Szeredi <mszeredi@redhat.com> wrote:

> I think that's crazy.  We don't return detailed errors for any other
> syscall for path lookup, so why would path lookup for mount be
> special.

Firstly, we don't return detailed errors for mount() at the moment either.

Secondly, path lookup might entail automounts, so perhaps we should do it for
path lookup too.  Particularly in light of the fact that NFS4 mount uses
pathwalk to get from server:/ to server:/the/dir/I/actually/wanted/ so I'm
currently losing that error:-/

Thirdly, the security operation I'm talking about is separate to path lookup -
though perhaps we should pass LOOKUP_MOUNT as an intent flag into pathwalk so
that the security check can be done there; perhaps combined with another one.

Fourthly, why shouldn't we consider extending the facility to other system
calls in future?  It would involve copying the string to task_struct and
providing a way to retrieve it, but that's not that hard to achieve.

> And why would
> 
>     fd = open("/foo/bar", O_PATH);
>     fsmount(fsfd, fd, NULL);
> 
> behave differently from
> 
>     fsmount(fsfd, -1, "/foo/bar");
> 
> ?

There's argument that the former should return EFAULT.  And that you should
set the path to "" and pass AT_EMPTY_PATH.  I should probably make sure it
does that - and add a flags field.  statx() was fixed to work this way.

Question for you: Should the MNT_* flags be passed to fsmount(), perhaps in
MS_* form?

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-08 22:57   ` David Howells
                       ` (2 preceding siblings ...)
  2017-05-09  9:41     ` David Howells
@ 2017-05-09  9:56     ` David Howells
  2017-05-09 12:38       ` Miklos Szeredi
  3 siblings, 1 reply; 66+ messages in thread
From: David Howells @ 2017-05-09  9:56 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dhowells, viro, linux-fsdevel, linux-nfs, lkml

Miklos Szeredi <mszeredi@redhat.com> wrote:

> So say we have commands like
> 
> "o+ foo"
> "o- bar"

The convention seems to be to prepend "no" to things you want to disable, so
let's stick with that, e.g.:

	"o foo"
	"o nobar"

otherwise we will have to have separate parsers for old mount() and the new sb
config code - and not just for NFS, but at least for ext2/3/4 also.

Further, we can only publish one format in /proc/mounts - and we cannot change
that from the foo/nofoo standard we already use as it's part of the UAPI.

> The generic option parser would just add or remove the option in the
> current set of options,

It sounds like you want to build up a string of "opt1,opt2,opt3" then have the
VFS add and remove things from it and then parse it into the filesystem's
internal structures on "commit".

> and commit would just call ->remount_fs() with the new set of options.

You're defining "commit" to do different things depending on the situation.
You need a separation between "commit create" and "commit update".

> It would probably not work for the NFS case, but that's okay, NFS can
> implement its own option parsing.

If NFS has to implement its own option parsing, we've done it wrong.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-09  9:32     ` David Howells
@ 2017-05-09 11:04       ` Miklos Szeredi
  0 siblings, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-09 11:04 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, lkml

On Tue, May 9, 2017 at 11:32 AM, David Howells <dhowells@redhat.com> wrote:
> Miklos Szeredi <mszeredi@redhat.com> wrote:
>
>> Forget remount, it's a historical remnant.
>
> I don't think it can't be set aside so lightly.  Within the kernel, the option
> parsing should share as much code as possible between new superblock config,
> old new mount and old remount.

Lets make things clear:  VFS didn't do any option parsing for
mount(2), it was all in filesystem's fstype->mount() and
s_op->remount_fs() operations.  What the VFS did do is filter out the
junk from MS_xxx options and pass only the relevant ones to the
filesystem creation functions, which was mount_fs() and
do_remount_sb().   Note how those functions are in super.c and don't
have a vfsmount argument.

So I propose introducing a third way of parsing arguments, which a
filesystem may implement via sb_config_ops (or whatever we want to
call it) that allows it to parse options into its internal structures
and have it be passed to superblock creation and superblock
reconfiguration ops (which also need to be new ones, that thake the
parsed options in the sb_config structure instead of as a comma
delimited string).  With the fsopen() API the generic code (possibly
via helpers called from fs code) would need to parse the "MS_xxx" type
options now, and the infrastructure for that is new, since previously
those options were parsed in userland instead of in the kernel.

There would be no duplication as filesystems would either implement
the old option parsing or the new one.

Also we could have various helpers that do most of the dirty work of
option parsing, allowing easy migration of filesystems.   In the end
the old method taking the unparsed options can go away.

And, as you say, the option parsing would be shared between old "new
mount", old "remount" and new sb config.  And it would be shared for
the unmigrated fs case as well as the migrated fs case.

And we are still only taking about sb config, not a word about mount
attributes; they should be irrelevant to any of this API shuffling and
new API additions.

> The 'trickiest' function we need to support is MS_RDONLY flipping.  That one
> affects both the mount and the superblock.  I think all the rest only affect
> one side or the other.
>
> Given that a superblock can be mounted in multiple places, do we need to count
> the number of read-only mounts that are holding a particular superblock and
> only flip the superblock when they're all read-only?

Nothing special going on here.  If sb is ro then adding a rw mount
should either fail or automatically go ro.  I think just erroring out
is the better of the two.


> Or do you advocate replacing "mount -o remount,[ro|rw]" with a pair of
> operations - one to flip the mount and the other to flip the superblock?

Yes, definitely.  That's exactly what users have been asking for
(there's even a bugzilla somewhere I don't remember)

Thanks
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-09  9:41     ` David Howells
@ 2017-05-09 12:02       ` Miklos Szeredi
  2017-05-09 18:51         ` Jeff Layton
  0 siblings, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-09 12:02 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, lkml

On Tue, May 9, 2017 at 11:41 AM, David Howells <dhowells@redhat.com> wrote:
> Miklos Szeredi <mszeredi@redhat.com> wrote:
>
>> I think that's crazy.  We don't return detailed errors for any other
>> syscall for path lookup, so why would path lookup for mount be
>> special.
>
> Firstly, we don't return detailed errors for mount() at the moment either.
>
> Secondly, path lookup might entail automounts, so perhaps we should do it for
> path lookup too.  Particularly in light of the fact that NFS4 mount uses
> pathwalk to get from server:/ to server:/the/dir/I/actually/wanted/ so I'm
> currently losing that error:-/
>
> Thirdly, the security operation I'm talking about is separate to path lookup -
> though perhaps we should pass LOOKUP_MOUNT as an intent flag into pathwalk so
> that the security check can be done there; perhaps combined with another one.
>
> Fourthly, why shouldn't we consider extending the facility to other system
> calls in future?  It would involve copying the string to task_struct and
> providing a way to retrieve it, but that's not that hard to achieve.

Maybe we should.   In fact that sounds like a splendid idea.  IMO even
better, than having errors go via the fsfd descriptor.  Pretty cheap
on the kernel side, and completely optional on the userspace side.

>
>> And why would
>>
>>     fd = open("/foo/bar", O_PATH);
>>     fsmount(fsfd, fd, NULL);
>>
>> behave differently from
>>
>>     fsmount(fsfd, -1, "/foo/bar");
>>
>> ?
>
> There's argument that the former should return EFAULT.  And that you should
> set the path to "" and pass AT_EMPTY_PATH.  I should probably make sure it
> does that - and add a flags field.  statx() was fixed to work this way.
>
> Question for you: Should the MNT_* flags be passed to fsmount(), perhaps in
> MS_* form?

MS_* flags are a mess.  I don't think they should be used for any new
functionality.  MNT_* flags are much better, but there are some
internal flags there as well.

I think the struct file model is better, where we have the external
O_* flags and the internal FMODE_* flags.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-09  9:56     ` David Howells
@ 2017-05-09 12:38       ` Miklos Szeredi
  0 siblings, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-09 12:38 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel, linux-nfs, lkml

On Tue, May 9, 2017 at 11:56 AM, David Howells <dhowells@redhat.com> wrote:
> Miklos Szeredi <mszeredi@redhat.com> wrote:
>
>> So say we have commands like
>>
>> "o+ foo"
>> "o- bar"
>
> The convention seems to be to prepend "no" to things you want to disable, so
> let's stick with that, e.g.:
>
>         "o foo"
>         "o nobar"
>
> otherwise we will have to have separate parsers for old mount() and the new sb
> config code - and not just for NFS, but at least for ext2/3/4 also.
>
> Further, we can only publish one format in /proc/mounts - and we cannot change
> that from the foo/nofoo standard we already use as it's part of the UAPI.

You're right, that this is a complicated issue and worth more
discussion.  And also you are right that we cannot change existing
UAPI, which is going to cause some headaches.

But that doesn't mean the new UAPI must follow the conventions of the
badly defined existing UAPI.

And the "no*" convention is anything but well defined, so we cannot
just stick it into generic code, because you'll find exceptions
everywhere.

And one more reason to have a new, unambiguous UAPI for retrieving
superblock options.

>
>> The generic option parser would just add or remove the option in the
>> current set of options,
>
> It sounds like you want to build up a string of "opt1,opt2,opt3" then have the
> VFS add and remove things from it and then parse it into the filesystem's
> internal structures on "commit".

That would be the default operation, if the filesystem doesn't define
its own parser.

>> and commit would just call ->remount_fs() with the new set of options.
>
> You're defining "commit" to do different things depending on the situation.
> You need a separation between "commit create" and "commit update".

It would be different, yes, at least until the superblock creation api
is completely transformed, at which point it may actually become the
same thing.  But lets not jump ahead.

>> It would probably not work for the NFS case, but that's okay, NFS can
>> implement its own option parsing.
>
> If NFS has to implement its own option parsing, we've done it wrong.

My above sentence was not clear.  What I meant to say that NFS needs
to implement the non-generic option parsing function in order to be
able to handle the case of "you can't change the server IP address".
Which it would want to do anyway, since it will result in cleaner
code.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/9] Implement fsopen() to prepare for a mount
  2017-05-04 13:34     ` Karel Zak
@ 2017-05-09 18:40       ` Jeff Layton
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff Layton @ 2017-05-09 18:40 UTC (permalink / raw)
  To: Karel Zak, David Howells
  Cc: viro, linux-fsdevel, linux-nfs, linux-kernel, mszeredi

On Thu, 2017-05-04 at 15:34 +0200, Karel Zak wrote:
> On Thu, May 04, 2017 at 02:06:51PM +0100, David Howells wrote:
> > Karel Zak <kzak@redhat.com> wrote:
> > > The very basic mount(2) problem is that you have to parse
> > > /proc/self/mountinfo to get information about the mounted filesystem.
> > > It seems that your read() is also one way communication.
> > > 
> > > What we really need is to have a way how to specify *what* you want to
> > > read. The error message is not enough, I want to know the finally used
> > > mount options, mount ID, etc. It would be nice to have something like
> > > 
> > > 
> > >    fsmount(mfd, AT_FDCWD, "/mnt", 0);
> > > 
> > >    write(mfd, "o");
> > >    read(mfd, ....);     // read mount options
> > > 
> > >    write(mdf, "i");
> > >    read(mfd, ....);     // read mount ID
> > > 
> > > 
> > > but it seems ugly. Maybe introduce another function like 
> > > 
> > >     fsinfo(mdf, "o", buf, bufsz)
> > > 

Not sure that it's any prettier, but you could use an ioctl to switch
between different read requests:

ioctl(mfd, FSOPEN_READ_OPTIONS);
read(mfd, ....);

ioctl(mfd, FSOPEN_READ_MOUNT_ID);
read(mfd, ....);

> > > to get mount options (etc.) and to avoid separate write & read.
> > 
> > What is it you're trying to do?  Just read back the state of the new mount?
> 
>  ...read back the state of the new mount, because for example mount
>  options can be modified by FS driver. It would be also nice to have
>  API to get state of arbitrary mount without parsing mountinfo (the
>  file is huge on some systems).
> 


-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-09 12:02       ` Miklos Szeredi
@ 2017-05-09 18:51         ` Jeff Layton
  2017-05-10  7:24           ` Miklos Szeredi
  2017-05-10  8:05           ` David Howells
  0 siblings, 2 replies; 66+ messages in thread
From: Jeff Layton @ 2017-05-09 18:51 UTC (permalink / raw)
  To: Miklos Szeredi, David Howells; +Cc: viro, linux-fsdevel, linux-nfs, lkml

On Tue, 2017-05-09 at 14:02 +0200, Miklos Szeredi wrote:
> On Tue, May 9, 2017 at 11:41 AM, David Howells <dhowells@redhat.com> wrote:
> > Miklos Szeredi <mszeredi@redhat.com> wrote:
> > 
> > > I think that's crazy.  We don't return detailed errors for any other
> > > syscall for path lookup, so why would path lookup for mount be
> > > special.
> > 
> > Firstly, we don't return detailed errors for mount() at the moment either.
> > 
> > Secondly, path lookup might entail automounts, so perhaps we should do it for
> > path lookup too.  Particularly in light of the fact that NFS4 mount uses
> > pathwalk to get from server:/ to server:/the/dir/I/actually/wanted/ so I'm
> > currently losing that error:-/
> > 
> > Thirdly, the security operation I'm talking about is separate to path lookup -
> > though perhaps we should pass LOOKUP_MOUNT as an intent flag into pathwalk so
> > that the security check can be done there; perhaps combined with another one.
> > 
> > Fourthly, why shouldn't we consider extending the facility to other system
> > calls in future?  It would involve copying the string to task_struct and
> > providing a way to retrieve it, but that's not that hard to achieve.
> 
> Maybe we should.   In fact that sounds like a splendid idea.  IMO even
> better, than having errors go via the fsfd descriptor.  Pretty cheap
> on the kernel side, and completely optional on the userspace side.
> 

A question here: What should happen if you go to set an error here, and
one is already set? Should it just free the string and replace it with
the new one? IOW, just keep the latest error? Or is it better to keep
the earlier one?

If you want to put this in the task_struct then I think you'll want to
sort that out. You could easily end up in this situation if a lot of
different kernel subsystems started using it to pass back detailed
errors.

> > 
> > > And why would
> > > 
> > >     fd = open("/foo/bar", O_PATH);
> > >     fsmount(fsfd, fd, NULL);
> > > 
> > > behave differently from
> > > 
> > >     fsmount(fsfd, -1, "/foo/bar");
> > > 
> > > ?
> > 
> > There's argument that the former should return EFAULT.  And that you should
> > set the path to "" and pass AT_EMPTY_PATH.  I should probably make sure it
> > does that - and add a flags field.  statx() was fixed to work this way.
> > 
> > Question for you: Should the MNT_* flags be passed to fsmount(), perhaps in
> > MS_* form?
> 
> MS_* flags are a mess.  I don't think they should be used for any new
> functionality.  MNT_* flags are much better, but there are some
> internal flags there as well.
> 
> I think the struct file model is better, where we have the external
> O_* flags and the internal FMODE_* flags.
> 
> Thanks,
> Miklos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-09 18:51         ` Jeff Layton
@ 2017-05-10  7:24           ` Miklos Szeredi
  2017-05-10  8:05           ` David Howells
  1 sibling, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-10  7:24 UTC (permalink / raw)
  To: Jeff Layton; +Cc: David Howells, viro, linux-fsdevel, linux-nfs, lkml

On Tue, May 9, 2017 at 8:51 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Tue, 2017-05-09 at 14:02 +0200, Miklos Szeredi wrote:
>> On Tue, May 9, 2017 at 11:41 AM, David Howells <dhowells@redhat.com> wrote:
>> > Miklos Szeredi <mszeredi@redhat.com> wrote:
>> >
>> > > I think that's crazy.  We don't return detailed errors for any other
>> > > syscall for path lookup, so why would path lookup for mount be
>> > > special.
>> >
>> > Firstly, we don't return detailed errors for mount() at the moment either.
>> >
>> > Secondly, path lookup might entail automounts, so perhaps we should do it for
>> > path lookup too.  Particularly in light of the fact that NFS4 mount uses
>> > pathwalk to get from server:/ to server:/the/dir/I/actually/wanted/ so I'm
>> > currently losing that error:-/
>> >
>> > Thirdly, the security operation I'm talking about is separate to path lookup -
>> > though perhaps we should pass LOOKUP_MOUNT as an intent flag into pathwalk so
>> > that the security check can be done there; perhaps combined with another one.
>> >
>> > Fourthly, why shouldn't we consider extending the facility to other system
>> > calls in future?  It would involve copying the string to task_struct and
>> > providing a way to retrieve it, but that's not that hard to achieve.
>>
>> Maybe we should.   In fact that sounds like a splendid idea.  IMO even
>> better, than having errors go via the fsfd descriptor.  Pretty cheap
>> on the kernel side, and completely optional on the userspace side.
>>
>
> A question here: What should happen if you go to set an error here, and
> one is already set? Should it just free the string and replace it with
> the new one? IOW, just keep the latest error? Or is it better to keep
> the earlier one?
>
> If you want to put this in the task_struct then I think you'll want to
> sort that out. You could easily end up in this situation if a lot of
> different kernel subsystems started using it to pass back detailed
> errors.

Possible rule of thumb: use it only at the place where the error
originates and not where errors are just passed on.  This would result
in at most one report per syscall, normally.

And the static string thing that David implemented is also a very good
idea, IMO.

So it would look something like this (possibly needs better naming:

   error_detail("description of error");

or

   return error_detail(-EINVAL, "description of error");

Compiler could automatically include source file/line information as
well, although it may be enough if the string is uniquely greppable
(we could check uniqueness at compile time).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-09 18:51         ` Jeff Layton
  2017-05-10  7:24           ` Miklos Szeredi
@ 2017-05-10  8:05           ` David Howells
  2017-05-10 13:20             ` Jeff Layton
  2017-05-10 13:31             ` David Howells
  1 sibling, 2 replies; 66+ messages in thread
From: David Howells @ 2017-05-10  8:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Jeff Layton, viro, linux-fsdevel, linux-nfs, lkml

Miklos Szeredi <mszeredi@redhat.com> wrote:

> And the static string thing that David implemented is also a very good
> idea, IMO.

There is an issue with it: it's fine as long as you keep a ref on the module
that generated it or clear all strings as part of module removal (which the
mount context in this patchset does).  With the NFS mount context I did, I
have to keep a ref on the NFS protocol module as well as the NFS filesystem
module.

I'm tempted to make it conditionally copy the string using kvasprintf_const()
- which would also permit format substitution.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-09  8:03     ` Miklos Szeredi
@ 2017-05-10 12:41       ` Karel Zak
  0 siblings, 0 replies; 66+ messages in thread
From: Karel Zak @ 2017-05-10 12:41 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: David Howells, viro, linux-fsdevel, linux-nfs, lkml

On Tue, May 09, 2017 at 10:03:43AM +0200, Miklos Szeredi wrote:
> On Tue, May 9, 2017 at 12:57 AM, David Howells <dhowells@redhat.com> wrote:
> > Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> >> > + (3) Validate and pre-process the mount context.
> >>
> >> (3.5) Create super block
> >>
> >> I think this need to be triggered by something like a "commit" command
> >> from userspace.  Basically this is where the options are atomically
> >> set on the new (create) or existing (reconfigure) superblock.
> >
> > Why do you need to expose this step to userspace?  Assuming in the "new" case
> > you do, say:
> >
> >         fd = fsopen("nfs");
> >         write(fd, "s foo.bar:/bar", ...);
> >         write(fd, "o intr", ...);
> >         write(fd, "o fsc", ...);
> >         ...
> >         write(fd, "c", ...); /* commit operation to get a superblock */
> >         fsmount(fd, AT_FDCWD, "/mnt");  /* mount the superblock we just got */
> >
> > Then the "commit" op is dissimilar to "mount -o remount" since remount may
> > alter the superblock parameters *and* the mountpoint parameters, but commit
> > can only affect the superblock.
> 
> Forget remount, it's a historical remnant.  We need fsreconfig(sb) and
> setmntattr(mnt).  They are changing properties of different objects.

I agree and I'd like to highlight another issue we have with the
current mount(2). The problem is non-atomic work with more
propagation flags if you want to mount a filesystem. For example:

  mount /dev/sda1 /A -o private,unbindable,ro

this is supported by mount(8), but it's implemented by three
independent mount(2) calls

    - 1st mounts /dev/sda1 with MS_RDONLY
    - 2nd sets MS_PRIVATE flag
    - 3rd sets MS_UNBINDABLE flag.

it would be nice set all the VFS flags and then as atomic operation attach
the context to the tree.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-10  8:05           ` David Howells
@ 2017-05-10 13:20             ` Jeff Layton
  2017-05-10 13:30               ` Miklos Szeredi
  2017-05-10 13:31             ` David Howells
  1 sibling, 1 reply; 66+ messages in thread
From: Jeff Layton @ 2017-05-10 13:20 UTC (permalink / raw)
  To: David Howells, Miklos Szeredi; +Cc: viro, linux-fsdevel, linux-nfs, lkml

On Wed, 2017-05-10 at 09:05 +0100, David Howells wrote:
> Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> > Possible rule of thumb: use it only at the place where the error
> > originates and not where errors are just passed on.  This would result
> > in at most one report per syscall, normally.
> >

That might be hard to enforce in practice once you get into some
complicated layering. What if we have device_mapper setting this along
with filesystems too? We need clear rules here.

> > And the static string thing that David implemented is also a very good
> > idea, IMO.
> 
> There is an issue with it: it's fine as long as you keep a ref on the module
> that generated it or clear all strings as part of module removal (which the
> mount context in this patchset does).  With the NFS mount context I did, I
> have to keep a ref on the NFS protocol module as well as the NFS filesystem
> module.
> 
> I'm tempted to make it conditionally copy the string using kvasprintf_const()
> - which would also permit format substitution.
> 

On balance, I think this is a reasonable way to pass back detailed
errors. Up until now, we've mostly relied on just printk'ing them. Now
though, a lot of larger machines are running containerized setups. Good
luck scraping dmesg for _your_ error in that situation. There may be
tons of mounts failing all over the place.

That said, I have some concerns here:

What's the lifetime of these strings? Do they just hang around forever
until the process goes away or they're replaced? If this becomes common,
then you could easily end up with an extra string allocation per task in
some cases. That could add up.

One idea might be to always kfree it on syscall entry, and that might
mitigate the problem assuming that not everything is erroring out. Then
you could always do some trivial syscall to clear it manually.

There's also the problem of how these should be formatted. Is English ok
everywhere? Do we need a facility to allow translating these things?
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-10 13:20             ` Jeff Layton
@ 2017-05-10 13:30               ` Miklos Szeredi
  2017-05-10 13:33                 ` Miklos Szeredi
  2017-05-10 13:48                 ` Jeff Layton
  0 siblings, 2 replies; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-10 13:30 UTC (permalink / raw)
  To: Jeff Layton; +Cc: David Howells, viro, linux-fsdevel, linux-nfs, lkml

On Wed, May 10, 2017 at 3:20 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Wed, 2017-05-10 at 09:05 +0100, David Howells wrote:
>> Miklos Szeredi <mszeredi@redhat.com> wrote:
>>
>> > Possible rule of thumb: use it only at the place where the error
>> > originates and not where errors are just passed on.  This would result
>> > in at most one report per syscall, normally.
>> >
>
> That might be hard to enforce in practice once you get into some
> complicated layering. What if we have device_mapper setting this along
> with filesystems too? We need clear rules here.

If the error originates in the devicemapper, then why would the
filesystem set it?

There's always a root cause of an error and that should be where the
detailed error is set.

Am I missing something?

>
>> > And the static string thing that David implemented is also a very good
>> > idea, IMO.
>>
>> There is an issue with it: it's fine as long as you keep a ref on the module
>> that generated it or clear all strings as part of module removal (which the
>> mount context in this patchset does).  With the NFS mount context I did, I
>> have to keep a ref on the NFS protocol module as well as the NFS filesystem
>> module.
>>
>> I'm tempted to make it conditionally copy the string using kvasprintf_const()
>> - which would also permit format substitution.
>>
>
> On balance, I think this is a reasonable way to pass back detailed
> errors. Up until now, we've mostly relied on just printk'ing them. Now
> though, a lot of larger machines are running containerized setups. Good
> luck scraping dmesg for _your_ error in that situation. There may be
> tons of mounts failing all over the place.
>
> That said, I have some concerns here:
>
> What's the lifetime of these strings? Do they just hang around forever
> until the process goes away or they're replaced? If this becomes common,
> then you could easily end up with an extra string allocation per task in
> some cases. That could add up.

That's why I liked the static string thing.  It's just one assignment
and no worries about freeing.  Not sure what to do about modules,
though.  Can we somehow move the cost of checking the validity to the
place where the error is retrieved?

>
> One idea might be to always kfree it on syscall entry, and that might
> mitigate the problem assuming that not everything is erroring out. Then
> you could always do some trivial syscall to clear it manually.
>
> There's also the problem of how these should be formatted. Is English ok
> everywhere? Do we need a facility to allow translating these things?

Messages in dmesg are in English too.  If necessary userspace will do
the translation.  I don't think the kernel would need to worry about
that.

Thanks,
Miklos


> --
> Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-10  8:05           ` David Howells
  2017-05-10 13:20             ` Jeff Layton
@ 2017-05-10 13:31             ` David Howells
  2017-05-10 13:37               ` Jeff Layton
  1 sibling, 1 reply; 66+ messages in thread
From: David Howells @ 2017-05-10 13:31 UTC (permalink / raw)
  To: Jeff Layton
  Cc: dhowells, Miklos Szeredi, viro, linux-fsdevel, linux-nfs, lkml

Jeff Layton <jlayton@redhat.com> wrote:

> One idea might be to always kfree it on syscall entry

You can't do that otherwise there's no way to retrieve the strings.

David

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-10 13:30               ` Miklos Szeredi
@ 2017-05-10 13:33                 ` Miklos Szeredi
  2017-05-10 13:48                 ` Jeff Layton
  1 sibling, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-10 13:33 UTC (permalink / raw)
  To: Jeff Layton; +Cc: David Howells, viro, linux-fsdevel, linux-nfs, lkml

> That's why I liked the static string thing.  It's just one assignment
> and no worries about freeing.  Not sure what to do about modules,
> though.  Can we somehow move the cost of checking the validity to the
> place where the error is retrieved?

I'm thinking along the lines of not allowing module virtual addresses
to be recycled after module remove...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-10 13:31             ` David Howells
@ 2017-05-10 13:37               ` Jeff Layton
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff Layton @ 2017-05-10 13:37 UTC (permalink / raw)
  To: David Howells; +Cc: Miklos Szeredi, viro, linux-fsdevel, linux-nfs, lkml

On Wed, 2017-05-10 at 14:31 +0100, David Howells wrote:
> Jeff Layton <jlayton@redhat.com> wrote:
> 
> > One idea might be to always kfree it on syscall entry
> 
> You can't do that otherwise there's no way to retrieve the strings.
> 
> 

True...you'd have to exempt the syscall that does the retrieving.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-10 13:30               ` Miklos Szeredi
  2017-05-10 13:33                 ` Miklos Szeredi
@ 2017-05-10 13:48                 ` Jeff Layton
  2017-05-12  8:15                   ` Miklos Szeredi
  1 sibling, 1 reply; 66+ messages in thread
From: Jeff Layton @ 2017-05-10 13:48 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: David Howells, viro, linux-fsdevel, linux-nfs, lkml

On Wed, 2017-05-10 at 15:30 +0200, Miklos Szeredi wrote:
> On Wed, May 10, 2017 at 3:20 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Wed, 2017-05-10 at 09:05 +0100, David Howells wrote:
> > > Miklos Szeredi <mszeredi@redhat.com> wrote:
> > > 
> > > > Possible rule of thumb: use it only at the place where the error
> > > > originates and not where errors are just passed on.  This would result
> > > > in at most one report per syscall, normally.
> > > > 
> > 
> > That might be hard to enforce in practice once you get into some
> > complicated layering. What if we have device_mapper setting this along
> > with filesystems too? We need clear rules here.
> 
> If the error originates in the devicemapper, then why would the
> filesystem set it?
> 
> There's always a root cause of an error and that should be where the
> detailed error is set.
> 
> Am I missing something?
> 

I was thinking that you'd need some well-defined way to tell whether the
string should be replaced. If the thing just hangs out across syscalls,
then you don't know when it got put there. Is it a leftover from a
previous syscall or did a lower layer just put it there?

But...maybe I'm making assumptions about how this would work and I
should just wait until there are patches in flight. Getting the lifetime
of these strings right will be crucial though.

> > 
> > > > And the static string thing that David implemented is also a very good
> > > > idea, IMO.
> > > 
> > > There is an issue with it: it's fine as long as you keep a ref on the module
> > > that generated it or clear all strings as part of module removal (which the
> > > mount context in this patchset does).  With the NFS mount context I did, I
> > > have to keep a ref on the NFS protocol module as well as the NFS filesystem
> > > module.
> > > 
> > > I'm tempted to make it conditionally copy the string using kvasprintf_const()
> > > - which would also permit format substitution.
> > > 
> > 
> > On balance, I think this is a reasonable way to pass back detailed
> > errors. Up until now, we've mostly relied on just printk'ing them. Now
> > though, a lot of larger machines are running containerized setups. Good
> > luck scraping dmesg for _your_ error in that situation. There may be
> > tons of mounts failing all over the place.
> > 
> > That said, I have some concerns here:
> > 
> > What's the lifetime of these strings? Do they just hang around forever
> > until the process goes away or they're replaced? If this becomes common,
> > then you could easily end up with an extra string allocation per task in
> > some cases. That could add up.
> 
> That's why I liked the static string thing.  It's just one assignment
> and no worries about freeing.  Not sure what to do about modules,
> though.  Can we somehow move the cost of checking the validity to the
> place where the error is retrieved?
> 

Seems a little dangerous, and could be limiting. Dynamically allocated
strings seem like they could be more useful.

> > 
> > One idea might be to always kfree it on syscall entry, and that might
> > mitigate the problem assuming that not everything is erroring out. Then
> > you could always do some trivial syscall to clear it manually.
> > 
> > There's also the problem of how these should be formatted. Is English ok
> > everywhere? Do we need a facility to allow translating these things?
> 
> Messages in dmesg are in English too.  If necessary userspace will do
> the translation.  I don't think the kernel would need to worry about
> that.

Fair enough. It _is_ still an improvement over dmesg, IMO.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/9] VFS: Introduce a mount context
  2017-05-10 13:48                 ` Jeff Layton
@ 2017-05-12  8:15                   ` Miklos Szeredi
  0 siblings, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2017-05-12  8:15 UTC (permalink / raw)
  To: Jeff Layton; +Cc: David Howells, viro, linux-fsdevel, linux-nfs, lkml

On Wed, May 10, 2017 at 3:48 PM, Jeff Layton <jlayton@redhat.com> wrote:

> I was thinking that you'd need some well-defined way to tell whether the
> string should be replaced. If the thing just hangs out across syscalls,
> then you don't know when it got put there. Is it a leftover from a
> previous syscall or did a lower layer just put it there?

Example userspace code:

    /* Throw away previous error string */
    get_detailed_error(NULL, 0);
    ret = somesyscall(...);
    if (ret == -1) {
        char errbuf[1024];

        /* Get detailed error string for somesyscall */
        get_detailed_error(errbuf, sizeof(errbuf));
        err(1, errbuf);
    }

>> That's why I liked the static string thing.  It's just one assignment
>> and no worries about freeing.  Not sure what to do about modules,
>> though.  Can we somehow move the cost of checking the validity to the
>> place where the error is retrieved?
>>
>
> Seems a little dangerous,

True.

>  and could be limiting. Dynamically allocated
> strings seem like they could be more useful.

Overdesign always starts with that.  A static string is infinitely
more descriptive than an error num, and we've done pretty well with
the latter, so I'm not convinced that we really need a formatted
string.

Maybe just use kstrdup_const() if CONFIG_MODULE_UNLOAD is set,
otherwise plain assignment.  Then free the string when retrieving and
on task exit.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2017-05-12  8:15 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-03 16:04 [RFC][PATCH 0/9] VFS: Introduce mount context David Howells
2017-05-03 16:04 ` [PATCH 1/9] Provide a function to create a NUL-terminated string from unterminated data David Howells
2017-05-03 16:55   ` Jeff Layton
2017-05-03 19:26   ` Rasmus Villemoes
2017-05-03 20:13   ` David Howells
2017-05-03 16:04 ` [PATCH 2/9] Clean up whitespace in fs/namespace.c David Howells
2017-05-03 16:04 ` [PATCH 3/9] VFS: Introduce a mount context David Howells
2017-05-03 18:13   ` Jeff Layton
2017-05-03 18:26     ` Joe Perches
2017-05-03 20:38       ` Matthew Wilcox
2017-05-03 21:36         ` Joe Perches
2017-05-04  6:28           ` Julia Lawall
2017-05-03 21:17       ` David Howells
2017-05-03 18:37     ` David Howells
2017-05-03 18:43       ` Joe Perches
2017-05-03 20:11       ` David Howells
2017-05-04  9:27     ` David Howells
2017-05-04 14:34       ` Joe Perches
2017-05-03 21:43   ` Rasmus Villemoes
2017-05-04 10:22   ` David Howells
2017-05-08 15:05   ` Miklos Szeredi
2017-05-08 22:57   ` David Howells
2017-05-09  8:03     ` Miklos Szeredi
2017-05-10 12:41       ` Karel Zak
2017-05-09  9:32     ` David Howells
2017-05-09 11:04       ` Miklos Szeredi
2017-05-09  9:41     ` David Howells
2017-05-09 12:02       ` Miklos Szeredi
2017-05-09 18:51         ` Jeff Layton
2017-05-10  7:24           ` Miklos Szeredi
2017-05-10  8:05           ` David Howells
2017-05-10 13:20             ` Jeff Layton
2017-05-10 13:30               ` Miklos Szeredi
2017-05-10 13:33                 ` Miklos Szeredi
2017-05-10 13:48                 ` Jeff Layton
2017-05-12  8:15                   ` Miklos Szeredi
2017-05-10 13:31             ` David Howells
2017-05-10 13:37               ` Jeff Layton
2017-05-09  9:56     ` David Howells
2017-05-09 12:38       ` Miklos Szeredi
2017-05-03 16:05 ` [PATCH 4/9] Implement fsopen() to prepare for a mount David Howells
2017-05-03 18:37   ` Jeff Layton
2017-05-03 18:41   ` David Howells
2017-05-03 20:44   ` Rasmus Villemoes
2017-05-04 10:40   ` Karel Zak
2017-05-04 12:55   ` David Howells
2017-05-04 12:58   ` David Howells
2017-05-04 13:06   ` David Howells
2017-05-04 13:34     ` Karel Zak
2017-05-09 18:40       ` Jeff Layton
2017-05-08 15:10   ` Miklos Szeredi
2017-05-08 23:09   ` David Howells
2017-05-03 16:05 ` [PATCH 5/9] Implement fsmount() to effect a pre-configured mount David Howells
2017-05-03 16:05 ` [PATCH 6/9] Sample program for driving fsopen/fsmount David Howells
2017-05-03 16:05 ` [PATCH 7/9] procfs: Move proc_fill_super() to fs/proc/root.c David Howells
2017-05-03 16:05 ` [PATCH 8/9] proc: Support the mount context in procfs David Howells
2017-05-03 16:05 ` [PATCH 9/9] NFS: Support the mount context and fsopen() David Howells
2017-05-03 16:44 ` [RFC][PATCH 0/9] VFS: Introduce mount context Jeff Layton
2017-05-03 16:50 ` David Howells
2017-05-03 17:27   ` Jeff Layton
2017-05-05 14:35 ` Miklos Szeredi
2017-05-05 15:47 ` David Howells
2017-05-08  8:25   ` Miklos Szeredi
2017-05-08  8:35 ` David Howells
2017-05-08  8:43   ` Miklos Szeredi
2017-05-08 17:03 ` Djalal Harouni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).