linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] fs: add mount_setattr()
@ 2020-07-14 16:14 Christian Brauner
  2020-07-14 16:14 ` [PATCH] mount_setattr.2: New manual page documenting the mount_setattr() system call Christian Brauner
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Christian Brauner @ 2020-07-14 16:14 UTC (permalink / raw)
  To: David Howells, Al Viro, linux-fsdevel
  Cc: linux-kernel, linux-api, Michael Kerrisk, Christian Brauner

Hey everyone,

This series can be found at:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=mount_setattr
https://gitlab.com/brauner/linux/-/commits/mount_setattr
https://github.com/brauner/linux/tree/mount_setattr

This implements the mount_setattr() syscall which has come up a few
times already back in 2018 and again now in recent discussions and as
promised (see [1] and [2]) here is the first version. Sorry for the
delay but there's never enough time to get things done on time.

While the new mount api allows to change the properties of a superblock
there is currently no way to change the mount properties of a mount or
mount tree using mount file descriptors which the new mount api is based
on. In addition the old mount api has the big restriction that mount
options cannot be applied recursively. This hasn't changed since
changing mount options on a per-mount basis was implemented and has been
a frequent request and a source of bugs. The inability to change the
read-only mount option recursively alone is invaluable to userspace. Not
too long ago there was another significant security issue that was
caused by mount properties not being applied recurisvely (cf. [3]).
Additionally, various userspace projects have switched over to the new
mount api (including the ones I maintain) but we still have terrible
hacks and loops around changing mount properties for existing mount
trees. With mount_setattr() all these codepaths will be demoted to
fallback codepaths, hopefully. Michael Kerrisk recently pointed out the
missing support for this syscall as well.

The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:

int mount_setattr(int dfd, const char *path, unsigned flags,
                  struct mount_attr *uattr, size_t usize);

Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.

mount_setattr() can be expected to grow over time and is designed with
extensibility in mind. It follows the extensible syscall pattern we have
used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:

struct mount_attr {
	__u64 attr_set;
	__u64 attr_clr;
	__u32 propagation;
	__u32 atime;
};

The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.

The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.

Finally, struct mount_attr contains an @atime field which can be used to
set the atime behavior of a mount tree. Currently, access times are
already treated and defined like an enum in the new mount api so there's
no reason to treat them equivalent to a flag argument. A new atime enum
is introduced. The reason for not reusing the atime flags useable with
fsmount() and defined in the new mount api is that the
MOUNT_ATTR_RELATIME enum is defined as 0. This means, a user wanting to
transition to relative atime cannot simply specify MOUNT_ATTR_RELATIME
in @atime or @attr_set as this would mean not specifying any atime
settings is equivalent to specifying relative atime. This would cause
confusion for userspace as not specifying atime settings would switch
them to relatime. The new set of enums rectifies this by starting the
definition at 1 and letting 0 mean that atime settings are supposed to
be left unchanged.

Changing mount option has quite a few moving parts and the locking is
quite intricate so it is not unlikely that I got subtleties (very) wrong.

I've also merged the syscall addition into the individual arches into
the main patch itself since Linus once expressed that he prefers this
wher it makes sense. Manpage and selftests included.

[1]: https://lore.kernel.org/lkml/20200518144212.xpfjlajgwzwhlq7r@wittgenstein/
[2]: https://lore.kernel.org/lkml/CAKgNAkioH1z-pVimHziWP=ZtyBgCOwoC7ekWGFwzaZ1FPYg-tA@mail.gmail.com/
[3]: https://github.com/moby/moby/issues/37838

Thanks!
Christian

Christian Brauner (4):
  namespace: take lock_mount_hash() directly when changing flags
  namespace: only take read lock in do_reconfigure_mnt()
  fs: add mount_setattr()
  tests: add mount_setattr() selftests

 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 fs/internal.h                                 |   7 +
 fs/namespace.c                                | 300 ++++++-
 include/linux/syscalls.h                      |   3 +
 include/uapi/asm-generic/unistd.h             |   4 +-
 include/uapi/linux/mount.h                    |  31 +
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/mount_setattr/.gitignore        |   1 +
 .../testing/selftests/mount_setattr/Makefile  |   7 +
 tools/testing/selftests/mount_setattr/config  |   1 +
 .../mount_setattr/mount_setattr_test.c        | 802 ++++++++++++++++++
 27 files changed, 1145 insertions(+), 30 deletions(-)
 create mode 100644 tools/testing/selftests/mount_setattr/.gitignore
 create mode 100644 tools/testing/selftests/mount_setattr/Makefile
 create mode 100644 tools/testing/selftests/mount_setattr/config
 create mode 100644 tools/testing/selftests/mount_setattr/mount_setattr_test.c


base-commit: dcb7fd82c75ee2d6e6f9d8cc71c52519ed52e258
-- 
2.27.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mount_setattr.2: New manual page documenting the mount_setattr() system call
  2020-07-14 16:14 [PATCH 0/4] fs: add mount_setattr() Christian Brauner
@ 2020-07-14 16:14 ` Christian Brauner
  2020-07-21  9:59   ` Michael Kerrisk (man-pages)
  2020-07-14 16:14 ` [PATCH 1/4] namespace: take lock_mount_hash() directly when changing flags Christian Brauner
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2020-07-14 16:14 UTC (permalink / raw)
  To: David Howells, Al Viro, linux-fsdevel
  Cc: linux-kernel, linux-api, Michael Kerrisk, Christian Brauner

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 man2/mount_setattr.2 | 296 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 296 insertions(+)
 create mode 100644 man2/mount_setattr.2

diff --git a/man2/mount_setattr.2 b/man2/mount_setattr.2
new file mode 100644
index 000000000..aae10525e
--- /dev/null
+++ b/man2/mount_setattr.2
@@ -0,0 +1,296 @@
+.\" Copyright (c) 2020 by Christian Brauner <christian.brauner@ubuntu.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH MOUNT_SETATTR 2 2020-07-14 "Linux" "Linux Programmer's Manual"
+.SH NAME
+mount_setattr \- change mount options of a mount or mount tree
+.SH SYNOPSIS
+.nf
+.BI "int mount_setattr(int " dfd ", const char *" path ", unsigned int " flags ,
+.BI "                  struct mount_attr *" attr ", size_t " size );
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR mount_setattr ()
+system call changes the mount properties of a mount or whole mount tree.
+If
+.I path
+is a relative pathname, then it is interpreted relative to the directory
+referred to by the file descriptor
+.I dirfd
+(or the current working directory of the calling process, if
+.I dirfd
+is the special value
+.BR AT_FDCWD ).
+If
+.BR AT_EMPTY_PATH
+is specified in
+.I flags
+then the mount properties of the mount identified by
+.I dirfd
+are changed.
+.PP
+The
+.I flags
+argument can be used to alter the path resolution behavior. The supported
+values are:
+.TP
+.in +4n
+.B AT_EMPTY_PATH
+.in +4n
+The mount properties of the mount identified by
+.I dfd
+are changed.
+.TP
+.in +4n
+.B AT_RECURSIVE
+.in +4n
+Change the mount properties of the whole mount tree.
+.TP
+.in +4n
+.B AT_SYMLINK_NOFOLLOW
+.in +4n
+Don't follow trailing symlinks.
+.TP
+.in +4n
+.B AT_NO_AUTOMOUNT
+.in +4n
+Don't trigger automounts.
+.PP
+The
+.I attr
+argument of
+.BR mount_setattr ()
+is a structure of the following form:
+.PP
+.in +4n
+.EX
+struct mount_attr {
+    u64 attr_set;    /* Mount properties to set. */
+    u64 attr_clr;    /* Mount properties to clear. */
+    u32 propagation; /* Mount propagation type. */
+    u32 atime;       /* Access time settings. */
+};
+.EE
+.in
+.PP
+The
+.I attr_set
+and
+.I attr_clr
+members are used to specify the mount options that are supposed to be set or
+cleared for a given mount or mount tree. The following mount attributes can be
+specified in the
+.I attr_set
+and
+.I attr_clear
+fields:
+.TP
+.in +4n
+.B MOUNT_ATTR_RDONLY
+.in +4n
+If set in
+.I attr_set
+makes the mount read only and if set in
+.I attr_clr
+removes the read only setting if set on the mount.
+.TP
+.in +4n
+.B MOUNT_ATTR_NOSUID
+.in +4n
+If set in
+.I attr_set
+makes the mount not honor set-user-ID and set-group-ID bits or file capabilities
+when executing programs
+and if set in
+.I attr_clr
+clears the set-user-ID, set-group-ID bits, file capability restriction if set on
+this mount.
+.TP
+.in +4n
+.B MOUNT_ATTR_NODEV
+.in +4n
+If set in
+.I attr_set
+prevents access to devices on this mount
+and if set in
+.I attr_clr
+removes the device access restriction if set on this mount.
+.TP
+.in +4n
+.B MOUNT_ATTR_NOEXEC
+.in +4n
+If set in
+.I attr_set
+prevents executing programs on this mount
+and if set in
+.I attr_clr
+removes the restriction to execute programs on this mount.
+.TP
+.in +4n
+.B MOUNT_ATTR_NODIRATIME
+.in +4n
+If set in
+.I attr_set
+prevents updating access time for directories on this mount
+and if set in
+.I attr_clr
+removes access time restriction for directories. Note that
+.I MOUNT_ATTR_NODIRATIME
+can be combined with other access time settings and is implied
+by the noatime setting. All other access time settins are mutually
+exclusive.
+.PP
+The
+.I propagation
+member is used to specify the propagation type of the mount or mount tree.
+The supported mount propagation settings are:
+.TP
+.in +4n
+.B MAKE_PROPAGATION_PRIVATE
+.in +4n
+Turn all mounts into private mounts. Mount and umount events do not propagate
+into or out of this mount point.
+.TP
+.in +4n
+.B MAKE_PROPAGATION_SHARED
+.in +4n
+Turn all mounts into shared mounts. Mount points share events with members of a
+peer group. Mount and unmount events immediately under this mount point
+will propagate to the other mount points that are members of the peer group.
+Propagation here means that the same mount or unmount will automatically occur
+under all of the other mount points in the peer group. Conversely, mount and
+unmount events that take place under peer mount points will propagate to this
+mount point.
+.TP
+.in +4n
+.B MAKE_PROPAGATION_DEPENDENT
+.in +4n
+Turn all mounts into dependent mounts. Mount and unmount events propagate into
+this mount point from a shared  peer group. Mount and unmount events under this
+mount point do not propagate to any peer.
+.TP
+.in +4n
+.B MAKE_PROPAGATION_UNBINDABLE
+.in +4n
+This is like a private mount, and in addition this mount can't be bind mounted.
+Attempts to bind mount this mount will fail.
+When a recursive bind mount is performed on a directory subtree, any bind
+mounts within the subtree are automatically pruned (i.e., not replicated) when
+replicating that subtree to produce the target subtree.
+.PP
+The
+.I atime
+member is used to specify the access time behavior on a mount or mount tree.
+The supported access times settings are:
+.TP
+.in +4n
+.B MAKE_ATIME_RELATIVE
+.in +4n
+When a file on is accessed via this mount, update the file's last access time
+(atime) only if the current value of atime is less than or equal to the file's
+last modification time (mtime) or last status change time (ctime).
+.TP
+.in +4n
+.B MAKE_ATIME_NONE
+.in +4n
+Do not update access times for (all types of) files on this mount.
+.TP
+.in +4n
+.B MAKE_ATIME_STRICT
+.in +4n
+Always update the last access time (atime) when files are
+accessed on this mount.
+.PP
+The
+.I size
+argument that is supplied to
+.BR mount_setattr ()
+should be initialized to the size of this structure.
+(The existence of the
+.I size
+argument permits future extensions to the
+.IR mount_attr
+structure.)
+.SH RETURN VALUE
+On success,
+.BR mount_setattr ()
+zero is returned. On error, \-1 is returned and
+.I errno
+is set to indicate the cause of the error.
+.SH ERRORS
+.TP
+.B EBADF
+.I dfd
+is not a valid file descriptor.
+.TP
+.B ENOENT
+A pathname was empty or had a nonexistent component.
+.TP
+.B EINVAL
+Unsupported value in
+.I flags
+.TP
+.B EINVAL
+Unsupported value was specified in the
+.I attr_set
+field of
+.IR mount_attr.
+.TP
+.B EINVAL
+Unsupported value was specified in the
+.I attr_clr
+field of
+.IR mount_attr.
+.TP
+.B EINVAL
+Unsupported value was specified in the
+.I propagation
+field of
+.IR mount_attr.
+.TP
+.B EINVAL
+Unsupported value was specified in the
+.I atime
+field of
+.IR mount_attr.
+.TP
+.B EINVAL
+Caller tried to change the mount properties of a mount or mount tree
+in another mount namespace.
+.SH VERSIONS
+.BR mount_setattr ()
+first appeared in Linux ?.?.
+.\" commit ?
+.SH CONFORMING TO
+.BR mount_setattr ()
+is Linux specific.
+.SH NOTES
+Currently, there is no glibc wrapper for this system call; call it using
+.BR syscall (2).
+.SH SEE ALSO
+.BR mount (2),

base-commit: 28a4c58cc211900943f48d65fd42b313ce54e5a6
-- 
2.27.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] namespace: take lock_mount_hash() directly when changing flags
  2020-07-14 16:14 [PATCH 0/4] fs: add mount_setattr() Christian Brauner
  2020-07-14 16:14 ` [PATCH] mount_setattr.2: New manual page documenting the mount_setattr() system call Christian Brauner
@ 2020-07-14 16:14 ` Christian Brauner
  2020-07-14 16:49   ` Jann Horn
  2020-07-14 16:14 ` [PATCH 2/4] namespace: only take read lock in do_reconfigure_mnt() Christian Brauner
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2020-07-14 16:14 UTC (permalink / raw)
  To: David Howells, Al Viro, linux-fsdevel
  Cc: linux-kernel, linux-api, Michael Kerrisk, Christian Brauner

Changing mount options always ends up taking lock_mount_hash() but when
MNT_READONLY is requested and neither the mount nor the superblock are
not already MNT_READONLY we end up taking the lock, dropping it, and
retaking it to change the other mount attributes. Instead of this,
acquire the lock once when changing mount properties. This simplifies
the locking in these codepath, makes them easier to reason about and
avoids having to reacquire the lock right after dropping it.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namespace.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index f30ed401cc6d..395b9c912edf 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -463,7 +463,6 @@ static int mnt_make_readonly(struct mount *mnt)
 {
 	int ret = 0;
 
-	lock_mount_hash();
 	mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
 	/*
 	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -497,15 +496,12 @@ static int mnt_make_readonly(struct mount *mnt)
 	 */
 	smp_wmb();
 	mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
-	unlock_mount_hash();
 	return ret;
 }
 
 static int __mnt_unmake_readonly(struct mount *mnt)
 {
-	lock_mount_hash();
 	mnt->mnt.mnt_flags &= ~MNT_READONLY;
-	unlock_mount_hash();
 	return 0;
 }
 
@@ -2517,11 +2513,9 @@ static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags)
  */
 static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
 {
-	lock_mount_hash();
 	mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
 	mnt->mnt.mnt_flags = mnt_flags;
 	touch_mnt_namespace(mnt->mnt_ns);
-	unlock_mount_hash();
 }
 
 static void mnt_warn_timestamp_expiry(struct path *mountpoint, struct vfsmount *mnt)
@@ -2567,9 +2561,11 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
 		return -EPERM;
 
 	down_write(&sb->s_umount);
+	lock_mount_hash();
 	ret = change_mount_ro_state(mnt, mnt_flags);
 	if (ret == 0)
 		set_mount_attributes(mnt, mnt_flags);
+	unlock_mount_hash();
 	up_write(&sb->s_umount);
 
 	mnt_warn_timestamp_expiry(path, &mnt->mnt);
@@ -2609,8 +2605,11 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
 		err = -EPERM;
 		if (ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) {
 			err = reconfigure_super(fc);
-			if (!err)
+			if (!err) {
+				lock_mount_hash();
 				set_mount_attributes(mnt, mnt_flags);
+				unlock_mount_hash();
+			}
 		}
 		up_write(&sb->s_umount);
 	}
-- 
2.27.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/4] namespace: only take read lock in do_reconfigure_mnt()
  2020-07-14 16:14 [PATCH 0/4] fs: add mount_setattr() Christian Brauner
  2020-07-14 16:14 ` [PATCH] mount_setattr.2: New manual page documenting the mount_setattr() system call Christian Brauner
  2020-07-14 16:14 ` [PATCH 1/4] namespace: take lock_mount_hash() directly when changing flags Christian Brauner
@ 2020-07-14 16:14 ` Christian Brauner
  2020-07-14 16:14 ` [PATCH 3/4] fs: add mount_setattr() Christian Brauner
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2020-07-14 16:14 UTC (permalink / raw)
  To: David Howells, Al Viro, linux-fsdevel
  Cc: linux-kernel, linux-api, Michael Kerrisk, Christian Brauner

do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock
which seems unnecessary since we're not changing the superblock. We're
only checking whether it is already read-only. Setting other mount
attributes is protected by lock_mount_hash() afaict and not by s_umount.

So I think the history of down_write(&sb->s_umount) lock being taken
when setting mount attributes dates back to the introduction of
MNT_READONLY in [2]. Afaict, this introduced the concept of having
read-only mounts in contrast to just having a read-only superblock. When
it got introduced it was simply plumbed into do_remount() which already
took down_write(&sb->s_umount) because it was only used to actually
change the superblock before [2]. Afaict, it would've already been
possible back then to only use down_read(&sb->s_umount) for
MS_BIND | MS_REMOUNT since actual mount options were protected by
the vfsmount lock already. But that would've meant special casing the
locking for MS_BIND | MS_REMOUNT in do_remount() which people might not
have considered worth it.
Then in [1] MS_BIND | MS_REMOUNT mount option changes were split out of
do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount)
lock was simply copied over.
Now that we have this be a separate helper only take
the down_read(&sb->s_umount) lock since we're only interested in
checking whether the super block is currently read-only and blocking any
writers from changing it. Essentially, checking that the super block is
read-only has the advantage that we can avoid having to go into the
slowpath and through MNT_WRITE_HOLD and can simply set the read-only
flag on the mount in set_mount_attributes().

[1]: commit 43f5e655eff7 ("vfs: Separate changing mount flags full remount")
[2]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namespace.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 395b9c912edf..ab025dd3be04 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2507,10 +2507,6 @@ static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags)
 	return __mnt_unmake_readonly(mnt);
 }
 
-/*
- * Update the user-settable attributes on a mount.  The caller must hold
- * sb->s_umount for writing.
- */
 static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
 {
 	mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
@@ -2560,13 +2556,17 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
 	if (!can_change_locked_flags(mnt, mnt_flags))
 		return -EPERM;
 
-	down_write(&sb->s_umount);
+	/*
+	 * We're only checking whether the superblock is read-only not changing
+	 * it, so only take down_read(&sb->s_umount).
+	 */
+	down_read(&sb->s_umount);
 	lock_mount_hash();
 	ret = change_mount_ro_state(mnt, mnt_flags);
 	if (ret == 0)
 		set_mount_attributes(mnt, mnt_flags);
 	unlock_mount_hash();
-	up_write(&sb->s_umount);
+	up_read(&sb->s_umount);
 
 	mnt_warn_timestamp_expiry(path, &mnt->mnt);
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/4] fs: add mount_setattr()
  2020-07-14 16:14 [PATCH 0/4] fs: add mount_setattr() Christian Brauner
                   ` (2 preceding siblings ...)
  2020-07-14 16:14 ` [PATCH 2/4] namespace: only take read lock in do_reconfigure_mnt() Christian Brauner
@ 2020-07-14 16:14 ` Christian Brauner
  2020-07-15  8:29   ` Geert Uytterhoeven
  2020-07-15 11:02   ` Christian Brauner
  2020-07-14 16:14 ` [PATCH 4/4] tests: add mount_setattr() selftests Christian Brauner
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 13+ messages in thread
From: Christian Brauner @ 2020-07-14 16:14 UTC (permalink / raw)
  To: David Howells, Al Viro, linux-fsdevel
  Cc: linux-kernel, linux-api, Michael Kerrisk, Christian Brauner,
	Aleksa Sarai

This implements the mount_setattr() syscall. While the new mount api
allows to change the properties of a superblock there is currently no
way to change the mount properties of a mount or mount tree using mount
file descriptors which the new mount api is based on. In addition the
old mount api has the restriction that mount options cannot be
applied recursively. This hasn't changed since changing mount options on
a per-mount basis was implemented in [1] and has been a frequent
request.
The legacy mount is currently unable to accommodate this behavior
without introducing a whole new set of flags because MS_REC | MS_REMOUNT
| MS_BIND | MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to
the topmost mount. Changing MS_REC to apply to the whole mount tree
would mean introducing a significant uapi change and would likely cause
significant regressions.

The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:

int mount_setattr(int dfd, const char *path, unsigned flags,
                  struct mount_attr *uattr, size_t usize);

Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.

mount_setattr() can be expected to grow over time and is designed with
extensibility in mind. It follows the extensible syscall pattern we have
used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:

struct mount_attr {
	__u64 attr_set;
	__u64 attr_clr;
	__u32 propagation;
	__u32 atime;

};

The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.

The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.

Finally, struct mount_attr contains an @atime field which can be used to
set the atime behavior of a mount tree. Currently, access times are
already treated and defined like an enum in the new mount api so there's
no reason to treat them equivalent to a flag argument. A new atime enum
is introduced. The reason for not reusing the atime flags useable with
fsmount() and defined in the new mount api is that the
MOUNT_ATTR_RELATIME enum is defined as 0. This means, a user wanting to
transition to relative atime cannot simply specify MOUNT_ATTR_RELATIME
in @atime or @attr_set as this would mean not specifying any atime
settings is equivalent to specifying relative atime. This would cause
confusion for userspace as not specifying atime settings would switch
them to relatime. The new set of enums rectifies this by starting the
definition at 1 and letting 0 mean that atime settings are supposed to
be left unchanged.

Changing mount option has quite a few moving parts and the locking is
quite intricate so it is not unlikely that I got subtleties wrong.

[1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
Cc: David Howells <dhowells@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/arm64/include/asm/unistd32.h           |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
 arch/s390/kernel/syscalls/syscall.tbl       |   1 +
 arch/sh/kernel/syscalls/syscall.tbl         |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
 fs/internal.h                               |   7 +
 fs/namespace.c                              | 275 ++++++++++++++++++--
 include/linux/syscalls.h                    |   3 +
 include/uapi/asm-generic/unistd.h           |   4 +-
 include/uapi/linux/mount.h                  |  31 +++
 22 files changed, 321 insertions(+), 17 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 5ddd128d4b7a..6c5c0b7a1c9e 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -478,3 +478,4 @@
 547	common	openat2				sys_openat2
 548	common	pidfd_getfd			sys_pidfd_getfd
 549	common	faccessat2			sys_faccessat2
+550	common	mount_setattr			sys_mount_setattr
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index d5cae5ffede0..10014c157e3f 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -452,3 +452,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	mount_setattr			sys_mount_setattr
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 6d95d0c8bf2f..7de6051fa380 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -885,6 +885,8 @@ __SYSCALL(__NR_openat2, sys_openat2)
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 #define __NR_faccessat2 439
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_mount_setattr 440
+__SYSCALL(__NR_mount_setattr, sys_mount_setattr)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 49e325b604b3..dd81c63f3970 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -359,3 +359,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	mount_setattr			sys_mount_setattr
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index f71b1bbcc198..cb78cb4da7dd 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	mount_setattr			sys_mount_setattr
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index edacc4561f2b..71a5b24e2b67 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -444,3 +444,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	mount_setattr			sys_mount_setattr
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index f777141f5256..9dcafeef6c07 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -377,3 +377,4 @@
 437	n32	openat2				sys_openat2
 438	n32	pidfd_getfd			sys_pidfd_getfd
 439	n32	faccessat2			sys_faccessat2
+440	n32	mount_setattr			sys_mount_setattr
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index da8c76394e17..5e51a29cc21f 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -353,3 +353,4 @@
 437	n64	openat2				sys_openat2
 438	n64	pidfd_getfd			sys_pidfd_getfd
 439	n64	faccessat2			sys_faccessat2
+440	n64	mount_setattr			sys_mount_setattr
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 13280625d312..5b5fa22cca16 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -426,3 +426,4 @@
 437	o32	openat2				sys_openat2
 438	o32	pidfd_getfd			sys_pidfd_getfd
 439	o32	faccessat2			sys_faccessat2
+440	o32	mount_setattr			sys_mount_setattr
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 5a758fa6ec52..e7fca7c8c407 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	mount_setattr			sys_mount_setattr
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index f833a3190822..cfb50e8c5d45 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -528,3 +528,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	mount_setattr			sys_mount_setattr
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index bfdcb7633957..12081f161b30 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
 437  common	openat2			sys_openat2			sys_openat2
 438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
 439  common	faccessat2		sys_faccessat2			sys_faccessat2
+440  common	mount_setattr		sys_mount_setattr		sys_mount_setattr
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index acc35daa1b79..d4ffc9846ceb 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	mount_setattr			sys_mount_setattr
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8004a276cb74..024f010ee63e 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -484,3 +484,4 @@
 437	common	openat2			sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	mount_setattr			sys_mount_setattr
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index d8f8a1a69ed1..a89034dd8bc3 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -443,3 +443,4 @@
 437	i386	openat2			sys_openat2
 438	i386	pidfd_getfd		sys_pidfd_getfd
 439	i386	faccessat2		sys_faccessat2
+440	i386	mount_setattr		sys_mount_setattr
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 78847b32e137..c23771eeb8df 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -360,6 +360,7 @@
 437	common	openat2			sys_openat2
 438	common	pidfd_getfd		sys_pidfd_getfd
 439	common	faccessat2		sys_faccessat2
+440	common	mount_setattr		sys_mount_setattr
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 69d0d73876b3..df0dfb1611f4 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -409,3 +409,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	mount_setattr			sys_mount_setattr
diff --git a/fs/internal.h b/fs/internal.h
index 9b863a7bd708..62f7526d7536 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -75,6 +75,13 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 /*
  * namespace.c
  */
+struct mount_kattr {
+	unsigned int attr_set;
+	unsigned int attr_clr;
+	unsigned int propagation;
+	unsigned int lookup_flags;
+	bool recurse;
+};
 extern void *copy_mount_options(const void __user *);
 extern char *copy_mount_string(const void __user *);
 
diff --git a/fs/namespace.c b/fs/namespace.c
index ab025dd3be04..13e29fcc82ab 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -459,10 +459,8 @@ void mnt_drop_write_file(struct file *file)
 }
 EXPORT_SYMBOL(mnt_drop_write_file);
 
-static int mnt_make_readonly(struct mount *mnt)
+static inline int mnt_hold_writers(struct mount *mnt)
 {
-	int ret = 0;
-
 	mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
 	/*
 	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -487,15 +485,30 @@ static int mnt_make_readonly(struct mount *mnt)
 	 * we're counting up here.
 	 */
 	if (mnt_get_writers(mnt) > 0)
-		ret = -EBUSY;
-	else
-		mnt->mnt.mnt_flags |= MNT_READONLY;
+		return -EBUSY;
+
+	return 0;
+}
+
+static inline void mnt_unhold_writers(struct mount *mnt)
+{
 	/*
 	 * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
 	 * that become unheld will see MNT_READONLY.
 	 */
 	smp_wmb();
 	mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
+}
+
+static int mnt_make_readonly(struct mount *mnt)
+{
+	int ret;
+
+	ret = mnt_hold_writers(mnt);
+	if (ret)
+		return ret;
+	mnt->mnt.mnt_flags |= MNT_READONLY;
+	mnt_unhold_writers(mnt);
 	return ret;
 }
 
@@ -3416,6 +3429,25 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
 	return ret;
 }
 
+static int build_attr_flags(unsigned int attr_flags, unsigned int *flags)
+{
+	unsigned int aflags = 0;
+
+	if (attr_flags & MOUNT_ATTR_RDONLY)
+		aflags |= MNT_READONLY;
+	if (attr_flags & MOUNT_ATTR_NOSUID)
+		aflags |= MNT_NOSUID;
+	if (attr_flags & MOUNT_ATTR_NODEV)
+		aflags |= MNT_NODEV;
+	if (attr_flags & MOUNT_ATTR_NOEXEC)
+		aflags |= MNT_NOEXEC;
+	if (attr_flags & MOUNT_ATTR_NODIRATIME)
+		aflags |= MNT_NODIRATIME;
+
+	*flags = aflags;
+	return 0;
+}
+
 /*
  * Create a kernel mount representation for a new, prepared superblock
  * (specified by fs_fd) and attach to an open_tree-like file descriptor.
@@ -3446,16 +3478,9 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
 			   MOUNT_ATTR_NODIRATIME))
 		return -EINVAL;
 
-	if (attr_flags & MOUNT_ATTR_RDONLY)
-		mnt_flags |= MNT_READONLY;
-	if (attr_flags & MOUNT_ATTR_NOSUID)
-		mnt_flags |= MNT_NOSUID;
-	if (attr_flags & MOUNT_ATTR_NODEV)
-		mnt_flags |= MNT_NODEV;
-	if (attr_flags & MOUNT_ATTR_NOEXEC)
-		mnt_flags |= MNT_NOEXEC;
-	if (attr_flags & MOUNT_ATTR_NODIRATIME)
-		mnt_flags |= MNT_NODIRATIME;
+	ret = build_attr_flags(attr_flags, &mnt_flags);
+	if (ret)
+		return ret;
 
 	switch (attr_flags & MOUNT_ATTR__ATIME) {
 	case MOUNT_ATTR_STRICTATIME:
@@ -3763,6 +3788,224 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
 	return error;
 }
 
+static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
+{
+	struct mount *mnt = real_mount(path->mnt), *m = mnt, *last = NULL;
+	unsigned int all_raised = kattr->attr_set | kattr->attr_clr;
+	bool rdonly_set = kattr->attr_set & MNT_READONLY;
+	int err = 0;
+
+	if (!check_mnt(m))
+		return -EINVAL;
+
+	if (path->dentry != m->mnt.mnt_root)
+		return -EINVAL;
+
+	if (kattr->propagation) {
+		/* Only take namespace_lock() if we're actually changing propagation. */
+		namespace_lock();
+		if (kattr->propagation == MS_SHARED) {
+			err = invent_group_ids(m, kattr->recurse);
+			if (err) {
+				namespace_unlock();
+				return err;
+			}
+		}
+	}
+
+	lock_mount_hash();
+	/*
+	 * Get the mount tree in a shape where we can change mount properties
+	 * without failure.
+	 */
+	m = mnt;
+	do {
+		last = m;
+
+		if (!can_change_locked_flags(m, all_raised)) {
+			err = -EPERM;
+			break;
+		}
+
+		if (rdonly_set && !(m->mnt.mnt_flags & MNT_READONLY)) {
+			err = mnt_hold_writers(m);
+			if (err)
+				break;
+		}
+	} while (kattr->recurse && (m = next_mnt(m, mnt)));
+
+	m = mnt;
+	do {
+		if (!err) {
+			unsigned int new_flags;
+
+			new_flags = m->mnt.mnt_flags;
+			if (kattr->attr_set & MNT_RELATIME) {
+				new_flags &= ~MNT_NOATIME;
+				new_flags |= MNT_RELATIME;
+			}
+			if (kattr->attr_set & MNT_NOATIME) {
+				new_flags &= ~MNT_RELATIME;
+				new_flags |= MNT_NOATIME;
+			}
+			/* Lower flags user wants us to clear. */
+			new_flags &= ~kattr->attr_clr;
+			/* Raise flags user wants us to set. */
+			new_flags |= kattr->attr_set;
+			WRITE_ONCE(m->mnt.mnt_flags, new_flags);
+		}
+
+		/*
+		 * We either set MNT_READONLY above so make it visible
+		 * before ~MNT_WRITE_HOLD or we failed to recursively
+		 * apply mount options.
+		 */
+		if (rdonly_set && (m->mnt.mnt_flags & MNT_WRITE_HOLD))
+			mnt_unhold_writers(m);
+
+		if (!err && kattr->propagation)
+			change_mnt_propagation(m, kattr->propagation);
+
+		/*
+		 * On failure, only cleanup until we found the first mount we
+		 * failed to handle.
+		 */
+		if (err && m == last)
+			break;
+	} while (kattr->recurse && (m = next_mnt(m, mnt)));
+
+	if (!err)
+		touch_mnt_namespace(mnt->mnt_ns);
+
+	unlock_mount_hash();
+
+	if (kattr->propagation) {
+		namespace_unlock();
+		if (err)
+			cleanup_group_ids(mnt, NULL);
+	}
+
+	return err;
+}
+
+static int build_mount_kattr(const struct mount_attr *attr,
+			     struct mount_kattr *kattr, unsigned int flags)
+{
+	unsigned int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
+
+	if (flags & AT_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (flags & AT_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (flags & AT_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+	*kattr = (struct mount_kattr){
+		.lookup_flags	= lookup_flags,
+		.recurse	= !!(flags & AT_RECURSIVE),
+	};
+
+	switch (attr->propagation) {
+	case MAKE_PROPAGATION_UNCHANGED:
+		kattr->propagation = 0;
+		break;
+	case MAKE_PROPAGATION_UNBINDABLE:
+		kattr->propagation = MS_UNBINDABLE;
+		break;
+	case MAKE_PROPAGATION_PRIVATE:
+		kattr->propagation = MS_PRIVATE;
+		break;
+	case MAKE_PROPAGATION_DEPENDENT:
+		kattr->propagation = MS_SLAVE;
+		break;
+	case MAKE_PROPAGATION_SHARED:
+		kattr->propagation = MS_SHARED;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+
+	if (attr->attr_set) {
+		if (attr->attr_set &
+		    ~(MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NODEV |
+		      MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODIRATIME))
+			return -EINVAL;
+
+		if (build_attr_flags(lower_32_bits(attr->attr_set), &kattr->attr_set))
+			return -EINVAL;
+	}
+
+	if (attr->attr_clr) {
+		if (attr->attr_clr &
+		    ~(MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NODEV |
+		      MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODIRATIME))
+			return -EINVAL;
+
+		if (build_attr_flags(lower_32_bits(attr->attr_clr), &kattr->attr_clr))
+			return -EINVAL;
+	}
+
+	switch (attr->atime) {
+	case MAKE_ATIME_UNCHANGED:
+		break;
+	case MAKE_ATIME_RELATIVE:
+		kattr->attr_set |= MNT_RELATIME;
+		break;
+	case MAKE_ATIME_NONE:
+		kattr->attr_set |= MNT_NOATIME;
+		break;
+	case MAKE_ATIME_STRICT:
+		kattr->attr_clr |= MNT_RELATIME | MNT_NOATIME;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+SYSCALL_DEFINE5(mount_setattr, int, dfd, const char __user *, path, unsigned int, flags,
+		struct mount_attr __user *, uattr, size_t, usize)
+{
+	int err;
+	struct path target;
+	struct mount_attr attr;
+	struct mount_kattr kattr;
+
+	BUILD_BUG_ON(sizeof(struct mount_attr) < MOUNT_ATTR_SIZE_VER0);
+	BUILD_BUG_ON(sizeof(struct mount_attr) != MOUNT_ATTR_SIZE_LATEST);
+
+	if (flags & ~(AT_EMPTY_PATH | AT_RECURSIVE | AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT))
+		return -EINVAL;
+
+	if (unlikely(usize < MOUNT_ATTR_SIZE_VER0))
+		return -EINVAL;
+
+	if (!may_mount())
+		return -EPERM;
+
+	err = copy_struct_from_user(&attr, sizeof(attr), uattr, usize);
+	if (err)
+		return err;
+
+	if (attr.attr_set == 0 && attr.attr_clr == 0 && attr.propagation == 0 &&
+	    attr.atime == 0)
+		return 0;
+
+	err = build_mount_kattr(&attr, &kattr, flags);
+	if (err)
+		return err;
+
+	err = user_path_at(dfd, path, kattr.lookup_flags, &target);
+	if (err)
+		return err;
+
+	err = do_mount_setattr(&target, &kattr);
+	path_put(&target);
+	return err;
+}
+
 static void __init init_mount_tree(void)
 {
 	struct vfsmount *mnt;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b951a87da987..f3e7b21980b1 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -69,6 +69,7 @@ union bpf_attr;
 struct io_uring_params;
 struct clone_args;
 struct open_how;
+struct mount_attr;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -996,6 +997,8 @@ asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
 asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 			       int to_dfd, const char __user *to_path,
 			       unsigned int ms_flags);
+asmlinkage long sys_mount_setattr(int dfd, const char __user *path, unsigned int flags,
+				  struct mount_attr __user *uattr, size_t usize);
 asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
 asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
 			     const void __user *value, int aux);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index f4a01305d9a6..c65644574038 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -857,9 +857,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 #define __NR_faccessat2 439
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_mount_setattr 440
+__SYSCALL(__NR_mount_setattr, sys_mount_setattr)
 
 #undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 441
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 96a0240f23fe..7582d545738b 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -117,4 +117,35 @@ enum fsconfig_command {
 #define MOUNT_ATTR_STRICTATIME	0x00000020 /* - Always perform atime updates */
 #define MOUNT_ATTR_NODIRATIME	0x00000080 /* Do not update directory access times */
 
+/*
+ * mount_setattr()
+ */
+struct mount_attr {
+	__u64 attr_set;
+	__u64 attr_clr;
+	__u32 propagation;
+	__u32 atime;
+};
+
+/* Change propagation through mount_setattr(). */
+enum propagation_type {
+	MAKE_PROPAGATION_UNCHANGED	= 0, /* Don't change mount propagation (default). */
+	MAKE_PROPAGATION_UNBINDABLE	= 1, /* Make unbindable. */
+	MAKE_PROPAGATION_PRIVATE	= 2, /* Do not receive or send mount events. */
+	MAKE_PROPAGATION_DEPENDENT	= 3, /* Only receive mount events. */
+	MAKE_PROPAGATION_SHARED		= 4, /* Send and receive mount events. */
+};
+
+/* Change atime settings through mount_setattr() */
+enum atime_type {
+	MAKE_ATIME_UNCHANGED	= 0, /* Don't change atime settings. */
+	MAKE_ATIME_RELATIVE	= 1, /* Update atime relative to mtime/ctime. */
+	MAKE_ATIME_NONE		= 2, /* Do not update access times. */
+	MAKE_ATIME_STRICT	= 3, /* Always perform atime updates. */
+};
+
+/* List of all mount_attr versions. */
+#define MOUNT_ATTR_SIZE_VER0	24 /* sizeof first published struct */
+#define MOUNT_ATTR_SIZE_LATEST	MOUNT_ATTR_SIZE_VER0
+
 #endif /* _UAPI_LINUX_MOUNT_H */
-- 
2.27.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/4] tests: add mount_setattr() selftests
  2020-07-14 16:14 [PATCH 0/4] fs: add mount_setattr() Christian Brauner
                   ` (3 preceding siblings ...)
  2020-07-14 16:14 ` [PATCH 3/4] fs: add mount_setattr() Christian Brauner
@ 2020-07-14 16:14 ` Christian Brauner
  2020-07-19 17:10 ` [PATCH 0/4] fs: add mount_setattr() Al Viro
  2020-10-01 14:17 ` Pavel Tikhomirov
  6 siblings, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2020-07-14 16:14 UTC (permalink / raw)
  To: David Howells, Al Viro, linux-fsdevel
  Cc: linux-kernel, linux-api, Michael Kerrisk, Christian Brauner

Add a range of selftests for the new mount_setattr() syscall to verify
that it works as expected. This tests that:
- no invalid flags can be specified
- changing properties of a single mount works and leaves other mounts in
  the mount tree unchanged
- changing a mount tre to read-only when one of the mounts has writers
  fails and leaves the whole mount tree unchanged
- changing mount properties from multiple threads works
- changing atime settings works
- changing mount propagation works
- changing the mount options of a mount tree where the individual mounts
  in the tree have different mount options only changes the flags that
  were requested to change
- changing mount options from another mount namespace fails
- changing mount options from another user namespace fails

[==========] Running 9 tests from 2 test cases.
[ RUN      ] mount_setattr.invalid_attributes
[       OK ] mount_setattr.invalid_attributes
[ RUN      ] mount_setattr.basic
[       OK ] mount_setattr.basic
[ RUN      ] mount_setattr.basic_recursive
[       OK ] mount_setattr.basic_recursive
[ RUN      ] mount_setattr.mount_has_writers
[       OK ] mount_setattr.mount_has_writers
[ RUN      ] mount_setattr.mixed_mount_options
[       OK ] mount_setattr.mixed_mount_options
[ RUN      ] mount_setattr.time_changes
[       OK ] mount_setattr.time_changes
[ RUN      ] mount_setattr.multi_threaded
[       OK ] mount_setattr.multi_threaded
[ RUN      ] mount_setattr.wrong_user_namespace
[       OK ] mount_setattr.wrong_user_namespace
[ RUN      ] mount_setattr.wrong_mount_namespace
[       OK ] mount_setattr.wrong_mount_namespace
[==========] 9 / 9 tests passed.
[  PASSED  ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/mount_setattr/.gitignore        |   1 +
 .../testing/selftests/mount_setattr/Makefile  |   7 +
 tools/testing/selftests/mount_setattr/config  |   1 +
 .../mount_setattr/mount_setattr_test.c        | 802 ++++++++++++++++++
 5 files changed, 812 insertions(+)
 create mode 100644 tools/testing/selftests/mount_setattr/.gitignore
 create mode 100644 tools/testing/selftests/mount_setattr/Makefile
 create mode 100644 tools/testing/selftests/mount_setattr/config
 create mode 100644 tools/testing/selftests/mount_setattr/mount_setattr_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 1195bd85af38..6620c16799a3 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -31,6 +31,7 @@ TARGETS += membarrier
 TARGETS += memfd
 TARGETS += memory-hotplug
 TARGETS += mount
+TARGETS += mount_setattr
 TARGETS += mqueue
 TARGETS += net
 TARGETS += net/forwarding
diff --git a/tools/testing/selftests/mount_setattr/.gitignore b/tools/testing/selftests/mount_setattr/.gitignore
new file mode 100644
index 000000000000..5f74d8488472
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/.gitignore
@@ -0,0 +1 @@
+mount_setattr_test
diff --git a/tools/testing/selftests/mount_setattr/Makefile b/tools/testing/selftests/mount_setattr/Makefile
new file mode 100644
index 000000000000..2250f7dcb81e
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for mount selftests.
+CFLAGS = -g -I../../../../usr/include/ -Wall -O2 -pthread
+
+TEST_GEN_FILES += mount_setattr_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/mount_setattr/config b/tools/testing/selftests/mount_setattr/config
new file mode 100644
index 000000000000..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/mount_setattr/mount_setattr_test.c b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
new file mode 100644
index 000000000000..d215470cc005
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
@@ -0,0 +1,802 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdio.h>
+#include <errno.h>
+#include <pthread.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/wait.h>
+#include <sys/vfs.h>
+#include <sys/statvfs.h>
+#include <sys/sysinfo.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <grp.h>
+#include <stdbool.h>
+#include <stdarg.h>
+
+#include "../kselftest_harness.h"
+
+#ifndef CLONE_NEWNS
+#define CLONE_NEWNS 0x00020000
+#endif
+
+#ifndef CLONE_NEWUSER
+#define CLONE_NEWUSER 0x10000000
+#endif
+
+#ifndef MS_REC
+#define MS_REC 16384
+#endif
+
+#ifndef MS_RELATIME
+#define MS_RELATIME (1 << 21)
+#endif
+
+#ifndef MS_STRICTATIME
+#define MS_STRICTATIME (1 << 24)
+#endif
+
+#ifndef MOUNT_ATTR_RDONLY
+#define MOUNT_ATTR_RDONLY 0x00000001
+#endif
+
+#ifndef MOUNT_ATTR_NOSUID
+#define MOUNT_ATTR_NOSUID 0x00000002
+#endif
+
+#ifndef MOUNT_ATTR_NOEXEC
+#define MOUNT_ATTR_NOEXEC 0x00000008
+#endif
+
+#ifndef MOUNT_ATTR_NODIRATIME
+#define MOUNT_ATTR_NODIRATIME 0x00000080
+#endif
+
+#ifndef MAKE_ATIME_UNCHANGED
+#define MAKE_ATIME_UNCHANGED 0
+#endif
+
+#ifndef MAKE_ATIME_RELATIVE
+#define MAKE_ATIME_RELATIVE 1
+#endif
+
+#ifndef MAKE_ATIME_NONE
+#define MAKE_ATIME_NONE 2
+#endif
+
+#ifndef MAKE_ATIME_STRICT
+#define MAKE_ATIME_STRICT 3
+#endif
+
+#ifndef AT_RECURSIVE
+#define AT_RECURSIVE 0x8000
+#endif
+
+#ifndef MAKE_PROPAGATION_SHARED
+#define MAKE_PROPAGATION_SHARED 4
+#endif
+
+#define DEFAULT_THREADS 4
+#define ptr_to_int(p) ((int)((intptr_t)(p)))
+#define int_to_ptr(u) ((void *)((intptr_t)(u)))
+
+#ifndef __NR_mount_setattr
+	#if defined __alpha__
+		#define __NR_mount_setattr 550
+	#elif defined _MIPS_SIM
+		#if _MIPS_SIM == _MIPS_SIM_ABI32	/* o32 */
+			#define __NR_mount_setattr 4440
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_NABI32	/* n32 */
+			#define __NR_mount_setattr 6440
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_ABI64	/* n64 */
+			#define __NR_mount_setattr 5440
+		#endif
+	#elif defined __ia64__
+		#define __NR_mount_setattr (440 + 1024)
+	#else
+		#define __NR_mount_setattr 440
+#endif
+
+struct mount_attr {
+	__u64 attr_set;
+	__u64 attr_clr;
+	__u32 propagation;
+	__u32 atime;
+};
+#endif
+
+static inline int sys_mount_setattr(int dfd, const char *path, unsigned int flags,
+				    struct mount_attr *attr, size_t size)
+{
+	return syscall(__NR_mount_setattr, dfd, path, flags, attr, size);
+}
+
+static ssize_t write_nointr(int fd, const void *buf, size_t count)
+{
+	ssize_t ret;
+
+	do {
+		ret = write(fd, buf, count);
+	} while (ret < 0 && errno == EINTR);
+
+	return ret;
+}
+
+static int write_file(const char *path, const void *buf, size_t count)
+{
+	int fd;
+	ssize_t ret;
+
+	fd = open(path, O_WRONLY | O_CLOEXEC | O_NOCTTY | O_NOFOLLOW);
+	if (fd < 0)
+		return -1;
+
+	ret = write_nointr(fd, buf, count);
+	close(fd);
+	if (ret < 0 || (size_t)ret != count)
+		return -1;
+
+	return 0;
+}
+
+static int create_and_enter_userns(void)
+{
+	uid_t uid;
+	gid_t gid;
+	char map[100];
+
+	uid = getuid();
+	gid = getgid();
+
+	if (unshare(CLONE_NEWUSER))
+		return -1;
+
+	if (write_file("/proc/self/setgroups", "deny", sizeof("deny") - 1) &&
+	    errno != ENOENT)
+		return -1;
+
+	snprintf(map, sizeof(map), "0 %d 1", uid);
+	if (write_file("/proc/self/uid_map", map, strlen(map)))
+		return -1;
+
+
+	snprintf(map, sizeof(map), "0 %d 1", gid);
+	if (write_file("/proc/self/gid_map", map, strlen(map)))
+		return -1;
+
+	if (setgid(0))
+		return -1;
+
+	if (setuid(0))
+		return -1;
+
+	return 0;
+}
+
+static int prepare_unpriv_mountns(void)
+{
+	if (create_and_enter_userns())
+		return -1;
+
+	if (unshare(CLONE_NEWNS))
+		return -1;
+
+	if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, 0))
+		return -1;
+
+	return 0;
+}
+
+static int read_mnt_flags(const char *path)
+{
+	int ret;
+	struct statvfs stat;
+	unsigned int mnt_flags;
+
+	ret = statvfs(path, &stat);
+	if (ret != 0)
+		return -EINVAL;
+
+	if (stat.f_flag &
+	    ~(ST_RDONLY | ST_NOSUID | ST_NODEV | ST_NOEXEC | ST_NOATIME |
+	      ST_NODIRATIME | ST_RELATIME | ST_SYNCHRONOUS | ST_MANDLOCK))
+		return -EINVAL;
+
+	mnt_flags = 0;
+	if (stat.f_flag & ST_RDONLY)
+		mnt_flags |= MS_RDONLY;
+	if (stat.f_flag & ST_NOSUID)
+		mnt_flags |= MS_NOSUID;
+	if (stat.f_flag & ST_NODEV)
+		mnt_flags |= MS_NODEV;
+	if (stat.f_flag & ST_NOEXEC)
+		mnt_flags |= MS_NOEXEC;
+	if (stat.f_flag & ST_NOATIME)
+		mnt_flags |= MS_NOATIME;
+	if (stat.f_flag & ST_NODIRATIME)
+		mnt_flags |= MS_NODIRATIME;
+	if (stat.f_flag & ST_RELATIME)
+		mnt_flags |= MS_RELATIME;
+	if (stat.f_flag & ST_SYNCHRONOUS)
+		mnt_flags |= MS_SYNCHRONOUS;
+	if (stat.f_flag & ST_MANDLOCK)
+		mnt_flags |= ST_MANDLOCK;
+
+	return mnt_flags;
+}
+
+static char *get_field(char *src, int nfields)
+{
+	int i;
+	char *p = src;
+
+	for (i = 0; i < nfields; i++) {
+		while (*p && *p != ' ' && *p != '\t')
+			p++;
+
+		if (!*p)
+			break;
+
+		p++;
+	}
+
+	return p;
+}
+
+static void null_endofword(char *word)
+{
+	while (*word && *word != ' ' && *word != '\t')
+		word++;
+	*word = '\0';
+}
+
+static bool is_shared_mount(const char *path)
+{
+	size_t len = 0;
+	char *line = NULL;
+	FILE *f = NULL;
+
+	f = fopen("/proc/self/mountinfo", "re");
+	if (!f)
+		return false;
+
+	while (getline(&line, &len, f) != -1) {
+		char *opts, *target;
+
+		target = get_field(line, 4);
+		if (!target)
+			continue;
+
+		opts = get_field(target, 2);
+		if (!opts)
+			continue;
+
+		null_endofword(target);
+
+		if (strcmp(target, path) != 0)
+			continue;
+
+		null_endofword(opts);
+		if (strstr(opts, "shared:"))
+			return true;
+	}
+
+	free(line);
+	fclose(f);
+
+	return false;
+}
+
+static void *mount_setattr_thread(void *data)
+{
+	struct mount_attr attr = {
+		.attr_set	= MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
+		.attr_clr	= 0,
+		.propagation	= MAKE_PROPAGATION_SHARED,
+	};
+
+	if (sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)))
+		pthread_exit(int_to_ptr(-1));
+
+	pthread_exit(int_to_ptr(0));
+}
+
+FIXTURE(mount_setattr) {
+};
+
+FIXTURE_SETUP(mount_setattr)
+{
+	EXPECT_EQ(prepare_unpriv_mountns(), 0);
+
+	(void)umount2("/mnt", MNT_DETACH);
+	(void)umount2("/tmp", MNT_DETACH);
+
+	EXPECT_EQ(mount("testing", "/tmp", "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+
+	EXPECT_EQ(mkdir("/tmp/B", 0777), 0);
+
+	EXPECT_EQ(mount("testing", "/tmp/B", "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+
+	EXPECT_EQ(mkdir("/tmp/B/BB", 0777), 0);
+
+	EXPECT_EQ(mount("testing", "/tmp/B/BB", "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+
+	EXPECT_EQ(mount("testing", "/mnt", "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+
+	EXPECT_EQ(mkdir("/mnt/A", 0777), 0);
+
+	EXPECT_EQ(mount("testing", "/mnt/A", "tmpfs", MS_NOATIME | MS_NODEV,
+			"size=100000,mode=700"), 0);
+
+	EXPECT_EQ(mkdir("/mnt/A/AA", 0777), 0);
+
+	EXPECT_EQ(mount("/tmp", "/mnt/A/AA", NULL, MS_BIND | MS_REC, NULL), 0);
+
+	EXPECT_EQ(mkdir("/mnt/B", 0777), 0);
+
+	EXPECT_EQ(mount("testing", "/mnt/B", "ramfs",
+			MS_NOATIME | MS_NODEV | MS_NOSUID, 0), 0);
+
+	EXPECT_EQ(mkdir("/mnt/B/BB", 0777), 0);
+
+	EXPECT_EQ(mount("testing", "/tmp/B/BB", "devpts",
+			MS_RELATIME | MS_NOEXEC | MS_RDONLY, 0), 0);
+}
+
+FIXTURE_TEARDOWN(mount_setattr)
+{
+	(void)umount2("/mnt/A", MNT_DETACH);
+	(void)umount2("/tmp", MNT_DETACH);
+}
+
+TEST_F(mount_setattr, invalid_attributes)
+{
+	struct mount_attr invalid_attr = {
+		.attr_set = (1U << 31),
+	};
+
+	EXPECT_LT(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr)), 0);
+
+	invalid_attr.attr_set	= 0;
+	invalid_attr.attr_clr	= (1U << 31);
+	EXPECT_LT(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr)), 0);
+
+	invalid_attr.attr_clr		= 0;
+	invalid_attr.propagation	= (1U << 31);
+	EXPECT_LT(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr)), 0);
+
+	invalid_attr.attr_clr		= 0;
+	invalid_attr.propagation	= 0;
+	invalid_attr.atime		= (1U << 31);
+	EXPECT_LT(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr)), 0);
+
+	invalid_attr.attr_set		= (1U << 31);
+	invalid_attr.attr_clr		= (1U << 31);
+	invalid_attr.propagation	= (1U << 31);
+	invalid_attr.atime		= (1U << 31);
+	EXPECT_LT(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr)), 0);
+
+	EXPECT_LT(sys_mount_setattr(-1, "mnt/A", AT_RECURSIVE, &invalid_attr,
+				    sizeof(invalid_attr)), 0);
+}
+
+TEST_F(mount_setattr, basic)
+{
+	unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+	struct mount_attr attr = {
+		.attr_set	= MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOEXEC,
+		.atime		= MAKE_ATIME_RELATIVE,
+	};
+
+	old_flags = read_mnt_flags("/mnt/A");
+	EXPECT_GT(old_flags, 0);
+
+	EXPECT_EQ(sys_mount_setattr(-1, "/mnt/A", 0, &attr, sizeof(attr)), 0);
+
+	expected_flags = old_flags;
+	expected_flags |= MS_RDONLY;
+	expected_flags |= MS_NOEXEC;
+	expected_flags &= ~MS_NOATIME;
+	expected_flags |= MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, old_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, old_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, old_flags);
+}
+
+TEST_F(mount_setattr, basic_recursive)
+{
+	int fd;
+	unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+	struct mount_attr attr = {
+		.attr_set	= MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOEXEC,
+		.atime		= MAKE_ATIME_RELATIVE,
+	};
+
+	old_flags = read_mnt_flags("/mnt/A");
+	EXPECT_GT(old_flags, 0);
+
+	EXPECT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags = old_flags;
+	expected_flags |= MS_RDONLY;
+	expected_flags |= MS_NOEXEC;
+	expected_flags &= ~MS_NOATIME;
+	expected_flags |= MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	attr.attr_clr = MOUNT_ATTR_RDONLY;
+	attr.propagation = MAKE_PROPAGATION_SHARED;
+	EXPECT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags &= ~MS_RDONLY;
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	EXPECT_EQ(is_shared_mount("/mnt/A"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	EXPECT_EQ(is_shared_mount("/mnt/A/AA"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	EXPECT_EQ(is_shared_mount("/mnt/A/AA/B"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	EXPECT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), true);
+
+	fd = open("/mnt/A/AA/B/b", O_RDWR | O_CLOEXEC | O_CREAT | O_EXCL, 0777);
+	ASSERT_GE(fd, 0);
+
+	/*
+	 * We're holding a fd open for writing so this needs to fail somewhere
+	 * in the middle and the mount options need to be unchanged.
+	 */
+	attr.attr_set = MOUNT_ATTR_RDONLY;
+	EXPECT_LT(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), true);
+
+	EXPECT_EQ(close(fd), 0);
+}
+
+TEST_F(mount_setattr, mount_has_writers)
+{
+	int fd;
+	unsigned int old_flags = 0, new_flags = 0;
+	struct mount_attr attr = {
+		.attr_set	= MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOEXEC,
+		.atime		= MAKE_ATIME_RELATIVE,
+		.propagation	= MAKE_PROPAGATION_SHARED,
+	};
+
+	old_flags = read_mnt_flags("/mnt/A");
+	EXPECT_GT(old_flags, 0);
+
+	fd = open("/mnt/A/AA/B/b", O_RDWR | O_CLOEXEC | O_CREAT | O_EXCL, 0777);
+	ASSERT_GE(fd, 0);
+
+	/*
+	 * We're holding a fd open to a mount somwhere in the middle so this
+	 * needs to fail somewhere in the middle. After this the mount options
+	 * need to be unchanged.
+	 */
+	EXPECT_LT(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, old_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A"), false);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, old_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA"), false);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, old_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B"), false);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, old_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), false);
+
+	EXPECT_EQ(close(fd), 0);
+
+	/* All writers are gone so this should succeed. */
+	EXPECT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+}
+
+TEST_F(mount_setattr, mixed_mount_options)
+{
+	unsigned int old_flags1 = 0, old_flags2 = 0, new_flags = 0, expected_flags = 0;
+	struct mount_attr attr = {
+		.attr_clr	= MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NOEXEC,
+		.atime		= MAKE_ATIME_RELATIVE,
+	};
+
+	old_flags1 = read_mnt_flags("/mnt/B");
+	EXPECT_GT(old_flags1, 0);
+
+	old_flags2 = read_mnt_flags("/mnt/B/BB");
+	EXPECT_GT(old_flags2, 0);
+
+	EXPECT_EQ(sys_mount_setattr(-1, "/mnt/B", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags = old_flags2;
+	expected_flags &= ~(MS_RDONLY | MS_NOEXEC | MS_NOATIME | MS_NOSUID);
+	expected_flags |= MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	expected_flags = old_flags2;
+	expected_flags &= ~(MS_RDONLY | MS_NOEXEC | MS_NOATIME | MS_NOSUID);
+	expected_flags |= MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+}
+
+TEST_F(mount_setattr, time_changes)
+{
+	unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+	struct mount_attr attr = {
+		.attr_set	= MOUNT_ATTR_NODIRATIME,
+		.attr_clr	= 0,
+		.atime		= MAKE_ATIME_NONE,
+	};
+
+	old_flags = read_mnt_flags("/mnt/A");
+	EXPECT_GT(old_flags, 0);
+
+	EXPECT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags = old_flags;
+	expected_flags |= MS_NOATIME;
+	expected_flags |= MS_NODIRATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	attr.atime = MAKE_ATIME_RELATIVE;
+	EXPECT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags &= ~MS_NOATIME;
+	expected_flags |= MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	attr.atime = MAKE_ATIME_STRICT;
+	EXPECT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags &= ~MS_RELATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	attr.atime = MAKE_ATIME_NONE;
+	EXPECT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags |= MS_NOATIME;
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	EXPECT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	memset(&attr, 0, sizeof(attr));
+	attr.attr_clr = MOUNT_ATTR_NODIRATIME;
+	EXPECT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+	expected_flags &= ~MS_NODIRATIME;
+
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+}
+
+TEST_F(mount_setattr, multi_threaded)
+{
+	int i, j, nthreads;
+	unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+	pthread_attr_t pattr;
+	pthread_t threads[DEFAULT_THREADS];
+
+	old_flags = read_mnt_flags("/mnt/A");
+	EXPECT_GT(old_flags, 0);
+
+	/* Try to change mount options from multiple threads. */
+	nthreads = get_nprocs_conf();
+	if (nthreads > DEFAULT_THREADS)
+		nthreads = DEFAULT_THREADS;
+
+	pthread_attr_init(&pattr);
+	for (i = 0; i < nthreads; i++)
+		EXPECT_EQ(pthread_create(&threads[i], &pattr, mount_setattr_thread, NULL), 0);
+
+	for (j = 0; j < i; j++) {
+		void *retptr = NULL;
+
+		EXPECT_EQ(pthread_join(threads[j], &retptr), 0);
+
+		ASSERT_EQ(ptr_to_int(retptr), 0);
+	}
+	pthread_attr_destroy(&pattr);
+
+	expected_flags = old_flags;
+	expected_flags |= MS_RDONLY;
+	expected_flags |= MS_NOSUID;
+	new_flags = read_mnt_flags("/mnt/A");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B"), true);
+
+	new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+	ASSERT_EQ(new_flags, expected_flags);
+
+	ASSERT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), true);
+}
+
+TEST_F(mount_setattr, wrong_user_namespace)
+{
+	int ret;
+	struct mount_attr attr = {
+		.attr_set = MOUNT_ATTR_RDONLY,
+	};
+
+	EXPECT_EQ(create_and_enter_userns(), 0);
+	ret = sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr));
+	ASSERT_LT(ret, 0);
+	ASSERT_EQ(errno, EPERM);
+}
+
+TEST_F(mount_setattr, wrong_mount_namespace)
+{
+	int fd, ret;
+	struct mount_attr attr = {
+		.attr_set = MOUNT_ATTR_RDONLY,
+	};
+
+	fd = open("/mnt/A", O_DIRECTORY | O_CLOEXEC);
+	EXPECT_GE(fd, 0);
+
+	EXPECT_EQ(unshare(CLONE_NEWNS), 0);
+
+	ret = sys_mount_setattr(fd, "", AT_EMPTY_PATH | AT_RECURSIVE, &attr, sizeof(attr));
+	ASSERT_LT(ret, 0);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+TEST_HARNESS_MAIN
-- 
2.27.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] namespace: take lock_mount_hash() directly when changing flags
  2020-07-14 16:14 ` [PATCH 1/4] namespace: take lock_mount_hash() directly when changing flags Christian Brauner
@ 2020-07-14 16:49   ` Jann Horn
  0 siblings, 0 replies; 13+ messages in thread
From: Jann Horn @ 2020-07-14 16:49 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, Al Viro, linux-fsdevel, kernel list, Linux API,
	Michael Kerrisk

On Tue, Jul 14, 2020 at 6:16 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> Changing mount options always ends up taking lock_mount_hash() but when
> MNT_READONLY is requested and neither the mount nor the superblock are
> not already MNT_READONLY we end up taking the lock, dropping it, and
> retaking it to change the other mount attributes. Instead of this,
> acquire the lock once when changing mount properties. This simplifies
> the locking in these codepath, makes them easier to reason about and
> avoids having to reacquire the lock right after dropping it.
[...]
> diff --git a/fs/namespace.c b/fs/namespace.c
[...]
> @@ -463,7 +463,6 @@ static int mnt_make_readonly(struct mount *mnt)
>  {
>         int ret = 0;
>
> -       lock_mount_hash();
>         mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
>         /*
>          * After storing MNT_WRITE_HOLD, we'll read the counters. This store
> @@ -497,15 +496,12 @@ static int mnt_make_readonly(struct mount *mnt)
>          */
>         smp_wmb();
>         mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
> -       unlock_mount_hash();
>         return ret;
>  }

It might be a good idea, instead of completely removing the locking
calls here, to replace them with lockdep_assert_held(...).
(Currently that doesn't appear much in core VFS code though.)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] fs: add mount_setattr()
  2020-07-14 16:14 ` [PATCH 3/4] fs: add mount_setattr() Christian Brauner
@ 2020-07-15  8:29   ` Geert Uytterhoeven
  2020-07-15 11:02   ` Christian Brauner
  1 sibling, 0 replies; 13+ messages in thread
From: Geert Uytterhoeven @ 2020-07-15  8:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, Al Viro, Linux FS Devel,
	Linux Kernel Mailing List, Linux API, Michael Kerrisk,
	Aleksa Sarai

On Tue, Jul 14, 2020 at 6:17 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> This implements the mount_setattr() syscall. While the new mount api
> allows to change the properties of a superblock there is currently no
> way to change the mount properties of a mount or mount tree using mount
> file descriptors which the new mount api is based on. In addition the
> old mount api has the restriction that mount options cannot be
> applied recursively. This hasn't changed since changing mount options on
> a per-mount basis was implemented in [1] and has been a frequent
> request.

[...]

> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

>  arch/m68k/kernel/syscalls/syscall.tbl       |   1 +

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] fs: add mount_setattr()
  2020-07-14 16:14 ` [PATCH 3/4] fs: add mount_setattr() Christian Brauner
  2020-07-15  8:29   ` Geert Uytterhoeven
@ 2020-07-15 11:02   ` Christian Brauner
  1 sibling, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2020-07-15 11:02 UTC (permalink / raw)
  To: David Howells, Al Viro, linux-fsdevel
  Cc: linux-kernel, linux-api, Michael Kerrisk, Aleksa Sarai

On Tue, Jul 14, 2020 at 06:14:15PM +0200, Christian Brauner wrote:
> This implements the mount_setattr() syscall. While the new mount api
> allows to change the properties of a superblock there is currently no
> way to change the mount properties of a mount or mount tree using mount
> file descriptors which the new mount api is based on. In addition the
> old mount api has the restriction that mount options cannot be
> applied recursively. This hasn't changed since changing mount options on
> a per-mount basis was implemented in [1] and has been a frequent
> request.
> The legacy mount is currently unable to accommodate this behavior
> without introducing a whole new set of flags because MS_REC | MS_REMOUNT
> | MS_BIND | MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to
> the topmost mount. Changing MS_REC to apply to the whole mount tree
> would mean introducing a significant uapi change and would likely cause
> significant regressions.
> 
> The new mount_setattr() syscall allows to recursively clear and set
> mount options in one shot. Multiple calls to change mount options
> requesting the same changes are idempotent:
> 
> int mount_setattr(int dfd, const char *path, unsigned flags,
>                   struct mount_attr *uattr, size_t usize);
> 
> Flags to modify path resolution behavior are specified in the @flags
> argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
> and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
> restrict path resolution as introduced with openat2() might be supported
> in the future.
> 
> mount_setattr() can be expected to grow over time and is designed with
> extensibility in mind. It follows the extensible syscall pattern we have
> used with other syscalls such as openat2(), clone3(),
> sched_{set,get}attr(), and others.
> The set of mount options is passed in the uapi struct mount_attr which
> currently has the following layout:
> 
> struct mount_attr {
> 	__u64 attr_set;
> 	__u64 attr_clr;
> 	__u32 propagation;
> 	__u32 atime;
> 
> };
> 
> The @attr_set and @attr_clr members are used to clear and set mount
> options. This way a user can e.g. request that a set of flags is to be
> raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
> @attr_set while at the same time requesting that another set of flags is
> to be lowered such as removing noexec from a mount tree by specifying
> MOUNT_ATTR_NOEXEC in @attr_clr.
> 
> The @propagation field lets callers specify the propagation type of a
> mount tree. Propagation is a single property that has four different
> settings and as such is not really a flag argument but an enum.
> Specifically, it would be unclear what setting and clearing propagation
> settings in combination would amount to. The legacy mount() syscall thus
> forbids the combination of multiple propagation settings too. The goal
> is to keep the semantics of mount propagation somewhat simple as they
> are overly complex as it is.
> 
> Finally, struct mount_attr contains an @atime field which can be used to
> set the atime behavior of a mount tree. Currently, access times are
> already treated and defined like an enum in the new mount api so there's
> no reason to treat them equivalent to a flag argument. A new atime enum
> is introduced. The reason for not reusing the atime flags useable with
> fsmount() and defined in the new mount api is that the
> MOUNT_ATTR_RELATIME enum is defined as 0. This means, a user wanting to
> transition to relative atime cannot simply specify MOUNT_ATTR_RELATIME
> in @atime or @attr_set as this would mean not specifying any atime
> settings is equivalent to specifying relative atime. This would cause
> confusion for userspace as not specifying atime settings would switch
> them to relatime. The new set of enums rectifies this by starting the
> definition at 1 and letting 0 mean that atime settings are supposed to
> be left unchanged.
> 
> Changing mount option has quite a few moving parts and the locking is
> quite intricate so it is not unlikely that I got subtleties wrong.
> 
> [1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
> Cc: David Howells <dhowells@redhat.com>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: linux-api@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
>  arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
>  arch/arm/tools/syscall.tbl                  |   1 +
>  arch/arm64/include/asm/unistd32.h           |   2 +
>  arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
>  arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
>  arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
>  arch/s390/kernel/syscalls/syscall.tbl       |   1 +
>  arch/sh/kernel/syscalls/syscall.tbl         |   1 +
>  arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
>  arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
>  fs/internal.h                               |   7 +
>  fs/namespace.c                              | 275 ++++++++++++++++++--
>  include/linux/syscalls.h                    |   3 +
>  include/uapi/asm-generic/unistd.h           |   4 +-
>  include/uapi/linux/mount.h                  |  31 +++
>  22 files changed, 321 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index 5ddd128d4b7a..6c5c0b7a1c9e 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -478,3 +478,4 @@
>  547	common	openat2				sys_openat2
>  548	common	pidfd_getfd			sys_pidfd_getfd
>  549	common	faccessat2			sys_faccessat2
> +550	common	mount_setattr			sys_mount_setattr
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index d5cae5ffede0..10014c157e3f 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -452,3 +452,4 @@
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
> +440	common	mount_setattr			sys_mount_setattr
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 6d95d0c8bf2f..7de6051fa380 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -885,6 +885,8 @@ __SYSCALL(__NR_openat2, sys_openat2)
>  __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
>  #define __NR_faccessat2 439
>  __SYSCALL(__NR_faccessat2, sys_faccessat2)
> +#define __NR_mount_setattr 440
> +__SYSCALL(__NR_mount_setattr, sys_mount_setattr)
>  
>  /*
>   * Please add new compat syscalls above this comment and update
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index 49e325b604b3..dd81c63f3970 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -359,3 +359,4 @@
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
> +440	common	mount_setattr			sys_mount_setattr
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index f71b1bbcc198..cb78cb4da7dd 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -438,3 +438,4 @@
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
> +440	common	mount_setattr			sys_mount_setattr
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index edacc4561f2b..71a5b24e2b67 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -444,3 +444,4 @@
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
> +440	common	mount_setattr			sys_mount_setattr
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index f777141f5256..9dcafeef6c07 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -377,3 +377,4 @@
>  437	n32	openat2				sys_openat2
>  438	n32	pidfd_getfd			sys_pidfd_getfd
>  439	n32	faccessat2			sys_faccessat2
> +440	n32	mount_setattr			sys_mount_setattr
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index da8c76394e17..5e51a29cc21f 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -353,3 +353,4 @@
>  437	n64	openat2				sys_openat2
>  438	n64	pidfd_getfd			sys_pidfd_getfd
>  439	n64	faccessat2			sys_faccessat2
> +440	n64	mount_setattr			sys_mount_setattr
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index 13280625d312..5b5fa22cca16 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -426,3 +426,4 @@
>  437	o32	openat2				sys_openat2
>  438	o32	pidfd_getfd			sys_pidfd_getfd
>  439	o32	faccessat2			sys_faccessat2
> +440	o32	mount_setattr			sys_mount_setattr
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index 5a758fa6ec52..e7fca7c8c407 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -436,3 +436,4 @@
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
> +440	common	mount_setattr			sys_mount_setattr
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index f833a3190822..cfb50e8c5d45 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -528,3 +528,4 @@
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
> +440	common	mount_setattr			sys_mount_setattr
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index bfdcb7633957..12081f161b30 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -441,3 +441,4 @@
>  437  common	openat2			sys_openat2			sys_openat2
>  438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
>  439  common	faccessat2		sys_faccessat2			sys_faccessat2
> +440  common	mount_setattr		sys_mount_setattr		sys_mount_setattr
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index acc35daa1b79..d4ffc9846ceb 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -441,3 +441,4 @@
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
> +440	common	mount_setattr			sys_mount_setattr
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index 8004a276cb74..024f010ee63e 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -484,3 +484,4 @@
>  437	common	openat2			sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
> +440	common	mount_setattr			sys_mount_setattr
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index d8f8a1a69ed1..a89034dd8bc3 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -443,3 +443,4 @@
>  437	i386	openat2			sys_openat2
>  438	i386	pidfd_getfd		sys_pidfd_getfd
>  439	i386	faccessat2		sys_faccessat2
> +440	i386	mount_setattr		sys_mount_setattr
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 78847b32e137..c23771eeb8df 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -360,6 +360,7 @@
>  437	common	openat2			sys_openat2
>  438	common	pidfd_getfd		sys_pidfd_getfd
>  439	common	faccessat2		sys_faccessat2
> +440	common	mount_setattr		sys_mount_setattr
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index 69d0d73876b3..df0dfb1611f4 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -409,3 +409,4 @@
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
> +440	common	mount_setattr			sys_mount_setattr
> diff --git a/fs/internal.h b/fs/internal.h
> index 9b863a7bd708..62f7526d7536 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -75,6 +75,13 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
>  /*
>   * namespace.c
>   */
> +struct mount_kattr {
> +	unsigned int attr_set;
> +	unsigned int attr_clr;
> +	unsigned int propagation;
> +	unsigned int lookup_flags;
> +	bool recurse;
> +};
>  extern void *copy_mount_options(const void __user *);
>  extern char *copy_mount_string(const void __user *);
>  
> diff --git a/fs/namespace.c b/fs/namespace.c
> index ab025dd3be04..13e29fcc82ab 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -459,10 +459,8 @@ void mnt_drop_write_file(struct file *file)
>  }
>  EXPORT_SYMBOL(mnt_drop_write_file);
>  
> -static int mnt_make_readonly(struct mount *mnt)
> +static inline int mnt_hold_writers(struct mount *mnt)
>  {
> -	int ret = 0;
> -
>  	mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
>  	/*
>  	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
> @@ -487,15 +485,30 @@ static int mnt_make_readonly(struct mount *mnt)
>  	 * we're counting up here.
>  	 */
>  	if (mnt_get_writers(mnt) > 0)
> -		ret = -EBUSY;
> -	else
> -		mnt->mnt.mnt_flags |= MNT_READONLY;
> +		return -EBUSY;
> +
> +	return 0;
> +}
> +
> +static inline void mnt_unhold_writers(struct mount *mnt)
> +{
>  	/*
>  	 * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
>  	 * that become unheld will see MNT_READONLY.
>  	 */
>  	smp_wmb();
>  	mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
> +}
> +
> +static int mnt_make_readonly(struct mount *mnt)
> +{
> +	int ret;
> +
> +	ret = mnt_hold_writers(mnt);
> +	if (ret)
> +		return ret;
> +	mnt->mnt.mnt_flags |= MNT_READONLY;
> +	mnt_unhold_writers(mnt);
>  	return ret;
>  }

Sorry, this has apparently been left from an old version. The helper for
the new mount api needs to clear MNT_WRITE_HOLD unconditionally of
course. I'll send an updated version:

+static int mnt_make_readonly(struct mount *mnt)
+{
+	int ret;
+
+	ret = mnt_hold_writers(mnt);
+	if (!ret)
+		mnt->mnt.mnt_flags |= MNT_READONLY;
+	mnt_unhold_writers(mnt);
 	return ret;
 }

Christian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] fs: add mount_setattr()
  2020-07-14 16:14 [PATCH 0/4] fs: add mount_setattr() Christian Brauner
                   ` (4 preceding siblings ...)
  2020-07-14 16:14 ` [PATCH 4/4] tests: add mount_setattr() selftests Christian Brauner
@ 2020-07-19 17:10 ` Al Viro
  2020-07-19 17:55   ` Christian Brauner
  2020-10-01 14:17 ` Pavel Tikhomirov
  6 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2020-07-19 17:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, linux-fsdevel, linux-kernel, linux-api, Michael Kerrisk

On Tue, Jul 14, 2020 at 06:14:11PM +0200, Christian Brauner wrote:

> mount_setattr() can be expected to grow over time and is designed with
> extensibility in mind. It follows the extensible syscall pattern we have
> used with other syscalls such as openat2(), clone3(),
> sched_{set,get}attr(), and others.

I.e. it's a likely crap insertion vector; any patches around that thing
will require the same level of review as addition of a brand new syscall.
And they will be harder to spot - consider the likely subjects for such
patches and compare to open addition of a new syscall...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] fs: add mount_setattr()
  2020-07-19 17:10 ` [PATCH 0/4] fs: add mount_setattr() Al Viro
@ 2020-07-19 17:55   ` Christian Brauner
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2020-07-19 17:55 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, linux-fsdevel, linux-kernel, linux-api, Michael Kerrisk

On Sun, Jul 19, 2020 at 06:10:54PM +0100, Al Viro wrote:
> On Tue, Jul 14, 2020 at 06:14:11PM +0200, Christian Brauner wrote:
> 
> > mount_setattr() can be expected to grow over time and is designed with
> > extensibility in mind. It follows the extensible syscall pattern we have
> > used with other syscalls such as openat2(), clone3(),
> > sched_{set,get}attr(), and others.
> 
> I.e. it's a likely crap insertion vector; any patches around that thing
> will require the same level of review as addition of a brand new syscall.

Which is just how we should and hopefully treat any meaningful
extension, yes. Otherwise let's just never add a flag argument to any
syscall and only have dup()- and accept()-like syscalls.

> And they will be harder to spot - consider the likely subjects for such
> patches and compare to open addition of a new syscall...

In the new revision I have dropped the atime argument because David
already plumbed setting atime into fsmount() via flags and making
userspace jump through more hoops by adding more constants seems
pointless. So the additional arguments can be moved up because we're
below the 6 syscall args limit.

Though I really want to stress that while I see the worry it is less a
technial argument but one for our review process where we should treat
extensions to syscalls as strict as syscall additions. Which yes, very
much so.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mount_setattr.2: New manual page documenting the mount_setattr() system call
  2020-07-14 16:14 ` [PATCH] mount_setattr.2: New manual page documenting the mount_setattr() system call Christian Brauner
@ 2020-07-21  9:59   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 13+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-07-21  9:59 UTC (permalink / raw)
  To: Christian Brauner, David Howells, Al Viro, linux-fsdevel
  Cc: mtk.manpages, linux-kernel, linux-api

Hello Christian,

Thanks for writing this manual page. A few comments below.

On 7/14/20 6:14 PM, Christian Brauner wrote:
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
>  man2/mount_setattr.2 | 296 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 296 insertions(+)
>  create mode 100644 man2/mount_setattr.2
> 
> diff --git a/man2/mount_setattr.2 b/man2/mount_setattr.2
> new file mode 100644
> index 000000000..aae10525e
> --- /dev/null
> +++ b/man2/mount_setattr.2
> @@ -0,0 +1,296 @@
> +.\" Copyright (c) 2020 by Christian Brauner <christian.brauner@ubuntu.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH MOUNT_SETATTR 2 2020-07-14 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +mount_setattr \- change mount options of a mount or mount tree
> +.SH SYNOPSIS
> +.nf
> +.BI "int mount_setattr(int " dfd ", const char *" path ", unsigned int " flags ,
> +.BI "                  struct mount_attr *" attr ", size_t " size );
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call; see NOTES.
> +.SH DESCRIPTION
> +The
> +.BR mount_setattr ()
> +system call changes the mount properties of a mount or whole mount tree.

You jump over a point between this sentence and the next. I suggest:

s/of a mount or whole mount tree
 /a mount or a whole mount tree whose location is specified
  by the dirfd and path arguments/

Seem okay?

> +If
> +.I path
> +is a relative pathname, then it is interpreted relative to the directory
> +referred to by the file descriptor
> +.I dirfd
> +(or the current working directory of the calling process, if
> +.I dirfd
> +is the special value
> +.BR AT_FDCWD ).
> +If
> +.BR AT_EMPTY_PATH
> +is specified in
> +.I flags

s/$/,/

> +then the mount properties of the mount identified by
> +.I dirfd
> +are changed.
> +.PP
> +The
> +.I flags
> +argument can be used to alter the path resolution behavior. The supported
> +values are:
> +.TP
> +.in +4n
> +.B AT_EMPTY_PATH
> +.in +4n
> +The mount properties of the mount identified by
> +.I dfd
> +are changed.
> +.TP
> +.in +4n
> +.B AT_RECURSIVE
> +.in +4n
> +Change the mount properties of the whole mount tree.
> +.TP
> +.in +4n
> +.B AT_SYMLINK_NOFOLLOW
> +.in +4n
> +Don't follow trailing symlinks.
> +.TP
> +.in +4n
> +.B AT_NO_AUTOMOUNT
> +.in +4n
> +Don't trigger automounts.
> +.PP
> +The
> +.I attr
> +argument of
> +.BR mount_setattr ()
> +is a structure of the following form:
> +.PP
> +.in +4n
> +.EX
> +struct mount_attr {
> +    u64 attr_set;    /* Mount properties to set. */
> +    u64 attr_clr;    /* Mount properties to clear. */
> +    u32 propagation; /* Mount propagation type. */
> +    u32 atime;       /* Access time settings. */
> +};
> +.EE
> +.in
> +.PP
> +The
> +.I attr_set
> +and
> +.I attr_clr
> +members are used to specify the mount options that are supposed to be set or
> +cleared for a given mount or mount tree. The following mount attributes can be
> +specified in the
> +.I attr_set
> +and
> +.I attr_clear
> +fields:
> +.TP
> +.in +4n
> +.B MOUNT_ATTR_RDONLY
> +.in +4n
> +If set in
> +.I attr_set

s/$/,/

> +makes the mount read only and if set in

s/makes/make/
(and similar below)

> +.I attr_clr
> +removes the read only setting if set on the mount.

s/removes/remove/
(and similar below)

> +.TP
> +.in +4n
> +.B MOUNT_ATTR_NOSUID
> +.in +4n
> +If set in
> +.I attr_set

s/$/,/

> +makes the mount not honor set-user-ID and set-group-ID bits or file capabilities
> +when executing programs
> +and if set in
> +.I attr_clr
> +clears the set-user-ID, set-group-ID bits, file capability restriction if set on
> +this mount.
> +.TP
> +.in +4n
> +.B MOUNT_ATTR_NODEV
> +.in +4n
> +If set in
> +.I attr_set

s/$/,/

> +prevents access to devices on this mount

s/prevents/prevent/
(and similar below)

> +and if set in
> +.I attr_clr
> +removes the device access restriction if set on this mount.
> +.TP
> +.in +4n
> +.B MOUNT_ATTR_NOEXEC
> +.in +4n
> +If set in
> +.I attr_set
> +prevents executing programs on this mount
> +and if set in
> +.I attr_clr
> +removes the restriction to execute programs on this mount.
> +.TP
> +.in +4n
> +.B MOUNT_ATTR_NODIRATIME
> +.in +4n
> +If set in
> +.I attr_set
> +prevents updating access time for directories on this mount
> +and if set in
> +.I attr_clr
> +removes access time restriction for directories. Note that
> +.I MOUNT_ATTR_NODIRATIME

s/.I/.B/

> +can be combined with other access time settings and is implied
> +by the noatime setting. All other access time settins are mutually

s/settins/settings/

> +exclusive.
> +.PP
> +The
> +.I propagation
> +member is used to specify the propagation type of the mount or mount tree.
> +The supported mount propagation settings are:
> +.TP
> +.in +4n
> +.B MAKE_PROPAGATION_PRIVATE
> +.in +4n
> +Turn all mounts into private mounts. Mount and umount events do not propagate

Since it may be one or many mounts, I suggest:

s/Turn all mounts into private mounts
 /Make the mount(s) private/

And similar changes below.

And: s/umount/unmount/

> +into or out of this mount point.
> +.TP
> +.in +4n
> +.B MAKE_PROPAGATION_SHARED

These names seem rather verbose. Does the "MAKE_" prefix really add value?
What about using just "PROPAGATION_SHARED" and so on?

And then: is there a reason not to use the MS_ constants already used by
mount(2): MS_SHARED, MS_PRIVATE, etc. I appreciate that those constants
have "odd" values, dictated by the fact that the 'mountflags' argument
of mount(2) is used for many other bits. But, is there harm in simply
reusing the same values and names? (But see my note below before
answering.)

> +.in +4n
> +Turn all mounts into shared mounts. Mount points share events with members of a
> +peer group. Mount and unmount events immediately under this mount point
> +will propagate to the other mount points that are members of the peer group.
> +Propagation here means that the same mount or unmount will automatically occur
> +under all of the other mount points in the peer group. Conversely, mount and
> +unmount events that take place under peer mount points will propagate to this
> +mount point.
> +.TP
> +.in +4n
> +.B MAKE_PROPAGATION_DEPENDENT

Okay -- maybe this is the answer to my question above.

I applaud the move to change the problematic language. Is this name
("DEPENDENT") your invention and/or is there an intent to to propagate this
this language change into any existing parts of the API (e.g., MS_DEPENDENT
as a synonym for MS_SLAVE)? I think it might be useful to add a sentence 
along the lines of:

"This flag is the equivalent of the (unfortunately named) mount(2) 
MS_SLAVE flag."

Just so that the reader can see the equivalence.

> +.in +4n
> +Turn all mounts into dependent mounts. Mount and unmount events propagate into
> +this mount point from a shared  peer group. Mount and unmount events under this
> +mount point do not propagate to any peer.
> +.TP
> +.in +4n
> +.B MAKE_PROPAGATION_UNBINDABLE
> +.in +4n

Add: "Make the mount(s) unbindable."

> +This is like a private mount, and in addition this mount can't be bind mounted.
> +Attempts to bind mount this mount will fail.
> +When a recursive bind mount is performed on a directory subtree, any bind
> +mounts within the subtree are automatically pruned (i.e., not replicated) when
> +replicating that subtree to produce the target subtree.
> +.PP
> +The
> +.I atime
> +member is used to specify the access time behavior on a mount or mount tree.

Somewhere around here, I think it would be good to add a sentence or two to 
explain why the 'atime' field exists in addition to the 'attr_set' and
'attr_clr' fields.

That said, looking at another mail in this discussion, is 'atime'
going away?

> +The supported access times settings are:
> +.TP
> +.in +4n
> +.B MAKE_ATIME_RELATIVE
> +.in +4n
> +When a file on is accessed via this mount, update the file's last access time
> +(atime) only if the current value of atime is less than or equal to the file's
> +last modification time (mtime) or last status change time (ctime).
> +.TP
> +.in +4n
> +.B MAKE_ATIME_NONE
> +.in +4n
> +Do not update access times for (all types of) files on this mount.
> +.TP
> +.in +4n
> +.B MAKE_ATIME_STRICT
> +.in +4n
> +Always update the last access time (atime) when files are
> +accessed on this mount.
> +.PP
> +The
> +.I size
> +argument that is supplied to
> +.BR mount_setattr ()
> +should be initialized to the size of this structure.
> +(The existence of the
> +.I size
> +argument permits future extensions to the
> +.IR mount_attr
> +structure.)
> +.SH RETURN VALUE
> +On success,
> +.BR mount_setattr ()
> +zero is returned. On error, \-1 is returned and

Wording problem...

s/zero is returned/returns 0/

> +.I errno
> +is set to indicate the cause of the error.
> +.SH ERRORS
> +.TP
> +.B EBADF
> +.I dfd
> +is not a valid file descriptor.
> +.TP
> +.B ENOENT
> +A pathname was empty or had a nonexistent component.
> +.TP
> +.B EINVAL
> +Unsupported value in
> +.I flags
> +.TP
> +.B EINVAL
> +Unsupported value was specified in the
> +.I attr_set
> +field of
> +.IR mount_attr.

s/.IR mount_attr./
  .IR mount_attr ./

(and the same at various places below)

> +.TP
> +.B EINVAL
> +Unsupported value was specified in the
> +.I attr_clr
> +field of
> +.IR mount_attr.
> +.TP
> +.B EINVAL
> +Unsupported value was specified in the
> +.I propagation
> +field of
> +.IR mount_attr.
> +.TP
> +.B EINVAL
> +Unsupported value was specified in the
> +.I atime
> +field of
> +.IR mount_attr.
> +.TP
> +.B EINVAL
> +Caller tried to change the mount properties of a mount or mount tree
> +in another mount namespace.
> +.SH VERSIONS
> +.BR mount_setattr ()
> +first appeared in Linux ?.?.
> +.\" commit ?
> +.SH CONFORMING TO
> +.BR mount_setattr ()
> +is Linux specific.
> +.SH NOTES
> +Currently, there is no glibc wrapper for this system call; call it using
> +.BR syscall (2).
> +.SH SEE ALSO
> +.BR mount (2),

Add: mount_namespaces (7).

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] fs: add mount_setattr()
  2020-07-14 16:14 [PATCH 0/4] fs: add mount_setattr() Christian Brauner
                   ` (5 preceding siblings ...)
  2020-07-19 17:10 ` [PATCH 0/4] fs: add mount_setattr() Al Viro
@ 2020-10-01 14:17 ` Pavel Tikhomirov
  6 siblings, 0 replies; 13+ messages in thread
From: Pavel Tikhomirov @ 2020-10-01 14:17 UTC (permalink / raw)
  To: Christian Brauner, David Howells, Al Viro, linux-fsdevel
  Cc: linux-kernel, linux-api, Michael Kerrisk, Andrew Vagin


> 
> mount_setattr() can be expected to grow over time and is designed with
> extensibility in mind. It follows the extensible syscall pattern we have
> used with other syscalls such as openat2(), clone3(),
> sched_{set,get}attr(), and others.
> The set of mount options is passed in the uapi struct mount_attr which
> currently has the following layout:
> 
> struct mount_attr {
> 	__u64 attr_set;
> 	__u64 attr_clr;
> 	__u32 propagation;
> 	__u32 atime;
> };
> 

We probably can rework "mnt: allow to add a mount into an existing 
group" (MS_SET_GROUP https://lkml.org/lkml/2017/1/23/712) to an 
extension mount_setattr. Do anyone have any objections?

We need it in CRIU because it is a big problem to restore complex mount 
trees with complex propagation flags (see my LPC talk for details 
https://linuxplumbersconf.org/event/7/contributions/640/) of 
system-containers.

If we allow set(copy) sharing options we can separate mount tree restore 
and propagation restore and everything becomes much simpler. (And we 
already have CRIU implementation based on it, which helps with variety 
of bugs with mounts we previously had. 
https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#880)

I've also tried to consider another approach 
https://github.com/Snorch/linux/commit/84886f588527b062993ec3e9760c879163852518 
to disable actual propagation while restoring the mount tree. But with 
this approach it looks like it would be still hard to restore in CRIU 
because: With "MS_SET_GROUP" we don't care about roots. Imagine we have 
mount A (shared_id=1, root="/some/sub/path") and mount B (master_id=1, 
root="/some/other/sub/path", with MS_SET_GROUP we can copy sharing from 
mount A to B and make B MS_SLAVE to restore them. In case we want to do 
the same with only inheritance we would have to have some helper mount 
in this share C (shared_id=1, root="/") so that B can inherit this share...

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-10-01 14:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-14 16:14 [PATCH 0/4] fs: add mount_setattr() Christian Brauner
2020-07-14 16:14 ` [PATCH] mount_setattr.2: New manual page documenting the mount_setattr() system call Christian Brauner
2020-07-21  9:59   ` Michael Kerrisk (man-pages)
2020-07-14 16:14 ` [PATCH 1/4] namespace: take lock_mount_hash() directly when changing flags Christian Brauner
2020-07-14 16:49   ` Jann Horn
2020-07-14 16:14 ` [PATCH 2/4] namespace: only take read lock in do_reconfigure_mnt() Christian Brauner
2020-07-14 16:14 ` [PATCH 3/4] fs: add mount_setattr() Christian Brauner
2020-07-15  8:29   ` Geert Uytterhoeven
2020-07-15 11:02   ` Christian Brauner
2020-07-14 16:14 ` [PATCH 4/4] tests: add mount_setattr() selftests Christian Brauner
2020-07-19 17:10 ` [PATCH 0/4] fs: add mount_setattr() Al Viro
2020-07-19 17:55   ` Christian Brauner
2020-10-01 14:17 ` Pavel Tikhomirov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).