All of lore.kernel.org
 help / color / mirror / Atom feed
* Questions re the new mount_setattr(2) manual page
@ 2021-08-10  1:38 Michael Kerrisk (man-pages)
  2021-08-10  7:12 ` Michael Kerrisk (man-pages)
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-08-10  1:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hi Christian,

Thanks for the very nice manual page that you wrote. I have
made a large number of (mostly trivial) edits. If you could
read the page closely, to check that I introduced no errors,
I would appreciate it.

I have various questions below, marked ???. Could you please take
a look at these, and I will then make further edits based on your
answers.

The current version of the page is already pushed to the man-pages
Git repo.

>   MOUNT_SETATTR(2)      Linux Programmer's Manual     MOUNT_SETATTR(2)
>
>   NAME
>       mount_setattr - change mount properties of a mount or mount

???
s/mount properties/properties ?

(Just bcause more concise.)

>       tree
>
>   SYNOPSIS
>       #include <linux/fcntl.h> /* Definition of AT_* constants */
>       #include <linux/mount.h> /* Definition of MOUNT_ATTR_* constants */
>       #include <sys/syscall.h> /* Definition of SYS_* constants */
>       #include <unistd.h>
>
>       int syscall(SYS_mount_setattr, int dirfd, const char *path,
>               unsigned int flags, struct mount_attr *attr, size_t size);
>
>       Note: glibc provides no wrapper for mount_setattr(),
>       necessitating the use of syscall(2).
>
>   DESCRIPTION
>       The mount_setattr() system call changes the mount properties
>       of a mount or an entire mount tree.  If path is a relative
>       pathname, then it is interpreted relative to the directory
>       referred to by the file descriptor dirfd.  If dirfd is the
>       special value AT_FDCWD, then path is interpreted relative to
>       the current working directory of the calling process.  If
>       path is the empty string and AT_EMPTY_PATH is specified in
>       flags, then the mount properties of the mount identified by
>       dirfd are changed.
>
>       The mount_setattr() system call uses an extensible structure
>       (struct mount_attr) to allow for future extensions.  Any non-
>       flag extensions to mount_setattr() will be implemented as new
>       fields appended to the this structure, with a zero value in a
>       new field resulting in the kernel behaving as though that
>       extension field was not present.  Therefore, the caller must
>       zero-fill this structure on initialization.  See the
>       "Extensibility" subsection under NOTES for more details.
>
>       The size argument should usually be specified as
>       sizeof(struct mount_attr).  However, if the caller does not
>       intend to make use of features that got introduced after the
>       initial version of struct mount_attr, it is possible to pass
>       the size of the initial struct together with the larger
>       struct.  This allows the kernel to not copy later parts of
>       the struct that aren't used anyway.  With each extension that
>       changes the size of struct mount_attr, the kernel will expose
>       a definition of the form MOUNT_ATTR_SIZE_VERnumber.  For
>       example, the macro for the size of the initial version of
>       struct mount_attr is MOUNT_ATTR_SIZE_VER0.

???
I think I understand the above paragraph, but I wonder if it could
be improved a little. The general principle is that one can always
pass the size of an earlier, smaller structure to the kernel, right?
My point is that it need not be the size of the initial structure,
right? So, I wonder whether a little rewording might be need above.
What do you think?

>
>       The flags argument can be used to alter the path resolution
>       behavior.  The supported values are:
>
>       AT_EMPTY_PATH
>              If path is the empty string, change the mount
>              properties on dirfd itself.
>
>       AT_RECURSIVE
>              Change the mount properties of the entire mount tree.
>
>       AT_SYMLINK_NOFOLLOW
>              Don't follow trailing symbolic links.
>
>       AT_NO_AUTOMOUNT
>              Don't trigger automounts.
>
>       The attr argument of mount_setattr() is a structure of the
>       following form:
>
>           struct mount_attr {
>               __u64 attr_set;     /* Mount properties to set */
>               __u64 attr_clr;     /* Mount properties to clear */
>               __u64 propagation;  /* Mount propagation type */
>               __u64 userns_fd;    /* User namespace file descriptor */
>           };
>
>       The attr_set and attr_clr members are used to specify the
>       mount properties that are supposed to be set or cleared for a
>       mount or mount tree.  Flags set in attr_set enable a property
>       on a mount or mount tree, and flags set in attr_clr remove a
>       property from a mount or mount tree.
>
>       When changing mount properties, the kernel will first clear
>       the flags specified in the attr_clr field, and then set the
>       flags specified in the attr_set field:

???
I find the following example a bit confusing. See below.

>
>           struct mount_attr attr = {
>               .attr_clr = MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODEV,
>               .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
>           };

???
I *think* that what you are trying to show is that the above initializer
resuts in the equivalent of the following code. Is that correct? If so, 
I think the text needs some work to make this clearer. Let me know.

>           unsigned int current_mnt_flags = mnt->mnt_flags;
>
>           /*
>            * Clear all flags set in .attr_clr,
>            * clearing MOUNT_ATTR_NOEXEC and MOUNT_ATTR_NODEV.
>            */
>           current_mnt_flags &= ~attr->attr_clr;
>
>           /*
>            * Now set all flags set in .attr_set,
>            * applying MOUNT_ATTR_RDONLY and MOUNT_ATTR_NOSUID.
>            */
>           current_mnt_flags |= attr->attr_set;
>
>           mnt->mnt_flags = current_mnt_flags;
>
>       As a rsult of this change, the mount or mount tree (a) is
>       read-only; (b) blocks the execution of set-user-ID and set-
>       group-ID programs; (c) allows execution of programs; and (d)
>       allows access to devices.
>
>       Multiple changes with the same set of flags requested in
>       attr_clr and attr_set are guaranteed to be idempotent after
>       the changes have been applied.
>
>       The following mount attributes can be specified in the
>       attr_set or attr_clr fields:
>
>       MOUNT_ATTR_RDONLY
>              If set in attr_set, makes the mount read-only.  If set
>              in attr_clr, removes the read-only setting if set on
>              the mount.
>
>       MOUNT_ATTR_NOSUID
>              If set in attr_set, causes the mount not to honor the
>              set-user-ID and set-group-ID mode bits and file
>              capabilities when executing programs.  If set in
>              attr_clr, clears the set-user-ID, set-group-ID, and
>              file capability restriction if set on this mount.
>
>       MOUNT_ATTR_NODEV
>              If set in attr_set, prevents access to devices on this
>              mount.  If set in attr_clr, removes the restriction
>              that prevented accessing devices on this mount.
>
>       MOUNT_ATTR_NOEXEC
>              If set in attr_set, prevents executing programs on
>              this mount.  If set in attr_clr, removes the
>              restriction that prevented executing programs on this
>              mount.
>
>       MOUNT_ATTR_NOSYMFOLLOW
>              If set in attr_set, prevents following symbolic links
>              on this mount.  If set in attr_clr, removes the
>              restriction that prevented following symbolic links on
>              this mount.
>
>       MOUNT_ATTR_NODIRATIME
>              If set in attr_set, prevents updating access time for
>              directories on this mount.  If set in attr_clr,
>              removes the restriction that prevented updating access
>              time for directories.  Note that MOUNT_ATTR_NODIRATIME
>              can be combined with other access-time settings and is
>              implied by the noatime setting.  All other access-time
>              settings are mutually exclusive.
>
>       MOUNT_ATTR__ATIME - changing access-time settings
>              In the new mount API, the access-time values are an
>              enum starting from 0.  Even though they are an enum
>              (in contrast to the other mount flags such as
>              MOUNT_ATTR_NOEXEC), they are nonetheless passed in
>              attr_set and attr_clr for consistency with fsmount(2),
>              which introduced this behavior.
>
>              Note that, since access times are an enum not a bit
>              map, users wanting to transition to a different
>              access-time setting cannot simply specify the access-
>              time setting in attr_set but must also set
>              MOUNT_ATTR__ATIME in the attr_clr field.  The kernel
>              will verify that MOUNT_ATTR__ATIME isn't partially set
>              in attr_clr, and that attr_set doesn't have any
>              access-time bits set if MOUNT_ATTR__ATIME isn't set in
>              attr_clr.
>
>              MOUNT_ATTR_RELATIME
>                     When a file is accessed via this mount, update
>                     the file's last access time (atime) only if the
>                     current value of atime is less than or equal to
>                     the file's last modification time (mtime) or
>                     last status change time (ctime).
>
>                     To enable this access-time setting on a mount
>                     or mount tree, MOUNT_ATTR_RELATIME must be set
>                     in attr_set and MOUNT_ATTR__ATIME must be set
>                     in the attr_clr field.
>
>              MOUNT_ATTR_NOATIME
>                     Do not update access times for (all types of)
>                     files on this mount.
>
>                     To enable this access-time setting on a mount
>                     or mount tree, MOUNT_ATTR_NOATIME must be set
>                     in attr_set and MOUNT_ATTR__ATIME must be set
>                     in the attr_clr field.
>
>              MOUNT_ATTR_STRICTATIME
>                     Always update the last access time (atime) when
>                     files are accessed on this mount.
>
>                     To enable this access-time setting on a mount
>                     or mount tree, MOUNT_ATTR_STRICTATIME must be
>                     set in attr_set and MOUNT_ATTR__ATIME must be
>                     set in the attr_clr field.
>
>       MOUNT_ATTR_IDMAP
>              If set in attr_set, creates an ID-mapped mount.  The
>              ID mapping is taken from the user namespace specified

In various places, you wrote "idmapping". "idmapped", etc. I've
changed these to the more natural English "ID mapping" etc.

>              in userns_fd and attached to the mount.
>
>              Since it is not supported to change the ID mapping of
>              a mount after it has been ID mapped, it is invalid to
>              specify MOUNT_ATTR_IDMAP in attr_clr.
>
>              For further details, see the subsection "ID-mapped
>              mounts" under NOTES.
>
>       The propagation field is used to specify the propagation type
>       of the mount or mount tree.  Mount propagation options are
>       mutually exclusive; that is, the propagation values behave
>       like an enum.  The supported mount propagation types are:
>
>       MS_PRIVATE
>              Turn all mounts into private mounts.  Mount and
>              unmount events do not propagate into or out of this
>              mount point.
>
>       MS_SHARED
>              Turn all mounts into shared mounts.  Mount points
>              share events with members of a peer group.  Mount and
>              unmount events immediately under this mount point will
>              propagate to the other mount points that are members
>              of the peer group.  Propagation here means that the
>              same mount or unmount will automatically occur under
>              all of the other mount points in the peer group.
>              Conversely, mount and unmount events that take place
>              under peer mount points will propagate to this mount
>              point.
>
>       MS_SLAVE
>              Turn all mounts into dependent mounts.  Mount and
>              unmount events propagate into this mount point from a
>              shared peer group.  Mount and unmount events under
>              this mount point do not propagate to any peer.
>
>       MS_UNBINDABLE
>              This is like a private mount, and in addition this
>              mount can't be bind mounted.  Attempts to bind mount
>              this mount will fail.  When a recursive bind mount is
>              performed on a directory subtree, any bind mounts
>              within the subtree are automatically pruned (i.e., not
>              replicated) when replicating that subtree to produce
>              the target subtree.
>
>       For further details on propagation types, see
>       mount_namespaces(7).
>
>   RETURN VALUE
>       On success, mount_setattr() returns zero.  On error, -1 is
>       returned and errno is set to indicate the cause of the error.
>
>   ERRORS
>       EBADF  dirfd is not a valid file descriptor.
>
>       EBADF  userns_fd is not a valid file descriptor.
>
>       EBUSY  The caller tried to change the mount to
>              MOUNT_ATTR_RDONLY, but the mount still holds files
>              open for writing.
>
>       EINVAL The path specified via the dirfd and path arguments to
>              mount_setattr() isn't a mount point.
>
>       EINVAL An unsupported value was set in flags.
>
>       EINVAL An unsupported value was specified in the attr_set
>              field of mount_attr.
>
>       EINVAL An unsupported value was specified in the attr_clr
>              field of mount_attr.
>
>       EINVAL An unsupported value was specified in the propagation
>              field of mount_attr.
>
>       EINVAL More than one of MS_SHARED, MS_SLAVE, MS_PRIVATE, or
>              MS_UNBINDABLE was set in the the propagation field of
>              mount_attr.
>
>       EINVAL An access-time setting was specified in the attr_set
>              field without MOUNT_ATTR__ATIME being set in the
>              attr_clr field.
>
>       EINVAL MOUNT_ATTR_IDMAP was specified in attr_clr.
>
>       EINVAL A file descriptor value was specified in userns_fd
>              which exceeds INT_MAX.
>
>       EINVAL A valid file descriptor value was specified in
>              userns_fd, but the file descriptor wasn't a namespace
>              file descriptor or did not refer to a user namespace.

???
Could the above not be simplified to

      EINVAL A valid file descriptor value was specified in
             userns_fd, but the file descriptor did not refer
             to a user namespace.
?

>
>       EINVAL The underlying filesystem does not support ID-mapped
>              mounts.
>
>       EINVAL The mount that is to be ID mapped is not a
>              detached/anonymous mount; that is, the mount is

???
What is a the distinction between "detached" and "anonymous"?
Or do you mean them to be synonymous? If so, then let's use
just one term, and I think "detached" is preferable.

>              already visible in the filesystem.
>
>       EINVAL A partial access-time setting was specified in
>              attr_clr instead of MOUNT_ATTR__ATIME being set.
>
>       EINVAL The mount is located outside the caller's mount
>              namespace.
>
>       EINVAL The underlying filesystem is mounted in a user
>              namespace.
>
>       ENOENT A pathname was empty or had a nonexistent component.
>
>       ENOMEM When changing mount propagation to MS_SHARED, a new
>              peer group ID needs to be allocated for all mounts
>              without a peer group ID set.  Allocation of this peer
>              group ID has failed.
>
>       ENOSPC When changing mount propagation to MS_SHARED, a new
>              peer group ID needs to be allocated for all mounts
>              without a peer group ID set.  Allocation of this peer
>              group ID can fail.  Note that technically further
>              error codes are possible that are specific to the ID
>              allocation implementation used.
>
>       EPERM  One of the mounts had at least one of
>              MOUNT_ATTR_NOATIME, MOUNT_ATTR_NODEV,
>              MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
>              MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the
>              flag is locked.  Mount attributes become locked on a
>              mount if:
>
>              •  A new mount or mount tree is created causing mount
>                 propagation across user namespaces.  The kernel
>                 will lock the aforementioned flags to protect these
>                 sensitive properties from being altered.
>
>              •  A new mount and user namespace pair is created.
>                 This happens for example when specifying
>                 CLONE_NEWUSER | CLONE_NEWNS in unshare(2),
>                 clone(2), or clone3(2).  The aforementioned flags
>                 become locked to protect user namespaces from
>                 altering sensitive mount properties.
>
>       EPERM  A valid file descriptor value was specified in
>              userns_fd, but the file descriptor refers to the
>              initial user namespace.
>
>       EPERM  An already ID-mapped mount was supposed to be ID
>              mapped.

???
Better:
    An attempt was made to add an ID mapping to a mount that is already
    ID mapped.
?

>
>       EPERM  The caller does not have CAP_SYS_ADMIN in the initial
>              user namespace.
>
>   VERSIONS
>       mount_setattr() first appeared in Linux 5.12.
>
>   CONFORMING TO
>       mount_setattr() is Linux-specific.
>
>   NOTES
>   ID-mapped mounts
>       Creating an ID-mapped mount makes it possible to change the
>       ownership of all files located under a mount.  Thus, ID-
>       mapped mounts make it possible to change ownership in a
>       temporary and localized way.  It is a localized change
>       because ownership changes are restricted to a specific mount.

???
Would it be clearer to say something like:

    It is a localized change because ownership changes are
    visible only via a specific mount.
?


>       All other users and locations where the filesystem is exposed
>       are unaffected.  And it is a temporary change because
>       ownership changes are tied to the lifetime of the mount.
>
>       Whenever callers interact with the filesystem through an ID-
>       mapped mount, the ID mapping of the mount will be applied to
>       user and group IDs associated with filesystem objects.  This
>       encompasses the user and group IDs associated with inodes and
>       also the following xattr(7) keys:
>
>       •  security.capability, whenever filesystem capabilities are
>          stored or returned in the VFS_CAP_REVISION_3 format, which
>          stores a root user ID alongside the capabilities (see
>          capabilities(7)).
>
>       •  system.posix_acl_access and system.posix_acl_default,
>          whenever user IDs or group IDs are stored in ACL_USER or
>          ACL_GROUP entries.
>
>       The following conditions must be met in order to create an
>       ID-mapped mount:
>
>       •  The caller must have the CAP_SYS_ADMIN capability in the
>          initial user namespace.
>
>       •  The filesystem must be mounted in the initial user
>          namespace.

???
Should this rather be written as:
 
     The filesystem must be mounted in a mount namespace 
     that is owned by the initial user namespace.

>       •  The underlying filesystem must support ID-mapped mounts.
>          Currently, the xfs(5), ext4(5), and FAT filesystems
>          support ID-mapped mounts with more filesystems being
>          actively worked on.
>
>       •  The mount must not already be ID-mapped.  This also
>          implies that the ID mapping of a mount cannot be altered.
>
>       •  The mount must be a detached/anonymous mount; that is, it

???
See the above questionon "detached" vs "anonymous"

>          must have been created by calling open_tree(2) with the
>          OPEN_TREE_CLONE flag and it must not already have been
>          visible in the filesystem.
>
>       ID mappings can be created for user IDs, group IDs, and
>       project IDs.  An ID mapping is essentially a mapping of a
>       range of user or group IDs into another or the same range of
>       user or group IDs.  ID mappings are usually written as three
>       numbers either separated by white space or a full stop.  The
>       first two numbers specify the starting user or group ID in
>       each of the two user namespaces.  The third number specifies
>       the range of the ID mapping.  For example, a mapping for user
>       IDs such as 1000:1001:1 would indicate that user ID 1000 in
>       the caller's user namespace is mapped to user ID 1001 in its
>       ancestor user namespace.  Since the map range is 1, only user
>       ID 1000 is mapped.

???
The details above seem wrong. When writing to map files, the
fields must be white-space separated, AFAIK. But above you mention
"full stops" and also show an example using colons (:). Those
both seem wrong and confusing. Am I missing something?

>       It is possible to specify up to 340 ID mappings for each ID
>       mapping type.  If any user IDs or group IDs are not mapped,
>       all files owned by that unmapped user or group ID will appear
>       as being owned by the overflow user ID or overflow group ID
>       respectively.
>
>       Further details and instructions for setting up ID mappings
>       can be found in the user_namespaces(7) man page.
>
>       In the common case, the user namespace passed in userns_fd
>       together with MOUNT_ATTR_IDMAP in attr_set to create an ID-
>       mapped mount will be the user namespace of a container.  In
>       other scenarios it will be a dedicated user namespace
>       associated with a user's login session as is the case for
>       portable home directories in systemd-homed.service(8)).  It
>       is also perfectly fine to create a dedicated user namespace
>       for the sake of ID mapping a mount.
>
>       ID-mapped mounts can be useful in the following and a variety
>       of other scenarios:
>
>       •  Sharing files between multiple users or multiple machines,

???
s/Sharing files/Sharing filesystems/ ?

>          especially in complex scenarios.  For example, ID-mapped
>          mounts are used to implement portable home directories in
>          systemd-homed.service(8), where they allow users to move
>          their home directory to an external storage device and use
>          it on multiple computers where they are assigned different
>          user IDs and group IDs.  This effectively makes it
>          possible to assign random user IDs and group IDs at login
>          time.
>
>       •  Sharing files from the host with unprivileged containers.

???
s/Sharing files/Sharing filesystems/ ?

>          This allows a user to avoid having to change ownership
>          permanently through chown(2).
>
>       •  ID mapping a container's root filesystem.  Users don't
>          need to change ownership permanently through chown(2).
>          Especially for large root filesystems, using chown(2) can
>          be prohibitively expensive.
>
>       •  Sharing files between containers with non-overlapping ID

???
s/Sharing files/Sharing filesystems/ ?

>          mappings.
>
>       •  Implementing discretionary access (DAC) permission
>          checking for filesystems lacking a concept of ownership.
>
>       •  Efficiently changing ownership on a per-mount basis.  In
>          contrast to chown(2), changing ownership of large sets of
>          files is instantaneous with ID-mapped mounts.  This is
>          especially useful when ownership of an entire root
>          filesystem of a virtual machine or container is to be
>          changed as mentioned above.  With ID-mapped mounts, a
>          single mount_setattr() system call will be sufficient to
>          change the ownership of all files.
>
>       •  Taking the current ownership into account.  ID mappings
>          specify precisely what a user or group ID is supposed to
>          be mapped to.  This contrasts with the chown(2) system
>          call which cannot by itself take the current ownership of
>          the files it changes into account.  It simply changes the
>          ownership to the specified user ID and group ID.
>
>       •  Locally and temporarily restricted ownership changes.  ID-
>          mapped mounts make it possible to change ownership
>          locally, restricting it to specific mounts, and

???
The referent of "it" in the preceding line is not clear.
Should it be "the ownership changes"? Or something else?

>          temporarily as the ownership changes only apply as long as
>          the mount exists.  By contrast, changing ownership via the
>          chown(2) system call changes the ownership globally and
>          permanently.
>
>   Extensibility
>       In order to allow for future extensibility, mount_setattr()
>       requires the user-space application to specify the size of
>       the mount_attr structure that it is passing.  By providing
>       this information, it is possible for mount_setattr() to
>       provide both forwards- and backwards-compatibility, with size
>       acting as an implicit version number.  (Because new extension
>       fields will always be appended, the structure size will
>       always increase.)  This extensibility design is very similar
>       to other system calls such as perf_setattr(2),
>       perf_event_open(2), clone3(2) and openat2(2).
>
>       Let usize be the size of the structure as specified by the
>       user-space application, and let ksize be the size of the
>       structure which the kernel supports, then there are three
>       cases to consider:
>
>       •  If ksize equals usize, then there is no version mismatch
>          and attr can be used verbatim.
>
>       •  If ksize is larger than usize, then there are some
>          extension fields that the kernel supports which the user-
>          space application is unaware of.  Because a zero value in
>          any added extension field signifies a no-op, the kernel
>          treats all of the extension fields not provided by the
>          user-space application as having zero values.  This
>          provides backwards-compatibility.
>
>       •  If ksize is smaller than usize, then there are some
>          extension fields which the user-space application is aware
>          of but which the kernel does not support.  Because any
>          extension field must have its zero values signify a no-op,
>          the kernel can safely ignore the unsupported extension
>          fields if they are all zero.  If any unsupported extension
>          fields are non-zero, then -1 is returned and errno is set
>          to E2BIG.  This provides forwards-compatibility.
>
>       Because the definition of struct mount_attr may change in the
>       future (with new fields being added when system headers are
>       updated), user-space applications should zero-fill struct
>       mount_attr to ensure that recompiling the program with new
>       headers will not result in spurious errors at runtime.  The
>       simplest way is to use a designated initializer:
>
>           struct mount_attr attr = {
>               .attr_set = MOUNT_ATTR_RDONLY,
>               .attr_clr = MOUNT_ATTR_NODEV
>           };
>
>       Alternatively, the structure can be zero-filled using
>       memset(3) or similar functions:
>
>           struct mount_attr attr;
>           memset(&attr, 0, sizeof(attr));
>           attr.attr_set = MOUNT_ATTR_RDONLY;
>           attr.attr_clr = MOUNT_ATTR_NODEV;
>
>       A user-space application that wishes to determine which
>       extensions the running kernel supports can do so by
>       conducting a binary search on size with a structure which has
>       every byte nonzero (to find the largest value which doesn't
>       produce an error of E2BIG).
>
>   EXAMPLES

???
Do you have a (preferably simple) example piece of code
somewhere for setting up an ID mapped mount?

>       /*
>        * This program allows the caller to create a new detached mount
>        * and set various properties on it.
>        */
>       #define _GNU_SOURCE
>       #include <errno.h>
>       #include <fcntl.h>
>       #include <getopt.h>
>       #include <linux/mount.h>
>       #include <linux/types.h>
>       #include <stdbool.h>
>       #include <stdio.h>
>       #include <stdlib.h>
>       #include <string.h>
>       #include <sys/syscall.h>
>       #include <unistd.h>
>
>       static inline int
>       mount_setattr(int dirfd, const char *path, unsigned int flags,
>                     struct mount_attr *attr, size_t size)
>       {
>           return syscall(SYS_mount_setattr, dirfd, path, flags,
>                          attr, size);
>       }
>
>       static inline int
>       open_tree(int dirfd, const char *filename, unsigned int flags)
>       {
>           return syscall(SYS_open_tree, dirfd, filename, flags);
>       }
>
>       static inline int
>       move_mount(int from_dirfd, const char *from_pathname,
>                  int to_dirfd, const char *to_pathname,
>                  unsigned int flags)
>       {
>           return syscall(SYS_move_mount, from_dirfd, from_pathname,
>                          to_dirfd, to_pathname, flags);
>       }
>
>       static const struct option longopts[] = {
>           {"map-mount",       required_argument,  NULL,  'a'},
>           {"recursive",       no_argument,        NULL,  'b'},
>           {"read-only",       no_argument,        NULL,  'c'},
>           {"block-setid",     no_argument,        NULL,  'd'},
>           {"block-devices",   no_argument,        NULL,  'e'},
>           {"block-exec",      no_argument,        NULL,  'f'},
>           {"no-access-time",  no_argument,        NULL,  'g'},
>           { NULL,             0,                  NULL,   0 },
>       };
>
>       #define exit_log(format, ...)  do           \
>       {                                           \
>           fprintf(stderr, format, ##__VA_ARGS__); \
>           exit(EXIT_FAILURE);                     \
>       } while (0)
>
>       int
>       main(int argc, char *argv[])
>       {
>           struct mount_attr *attr = &(struct mount_attr){};
>           int fd_userns = -EBADF;

???
Why this magic initializer here? Why not just "-1"?
Using -EBADF makes it look this is value specifically is
meaningful, although I don't think that's true.

>           bool recursive = false;
>           int index = 0;
>           int ret;
>
>           while ((ret = getopt_long_only(argc, argv, "",
>                                          longopts, &index)) != -1) {
>               switch (ret) {
>               case 'a':
>                   fd_userns = open(optarg, O_RDONLY | O_CLOEXEC);
>                   if (fd_userns == -1)
>                       exit_log("%m - Failed top open %s\n", optarg);
>                   break;
>               case 'b':
>                   recursive = true;
>                   break;
>               case 'c':
>                   attr->attr_set |= MOUNT_ATTR_RDONLY;
>                   break;
>               case 'd':
>                   attr->attr_set |= MOUNT_ATTR_NOSUID;
>                   break;
>               case 'e':
>                   attr->attr_set |= MOUNT_ATTR_NODEV;
>                   break;
>               case 'f':
>                   attr->attr_set |= MOUNT_ATTR_NOEXEC;
>                   break;
>               case 'g':
>                   attr->attr_set |= MOUNT_ATTR_NOATIME;
>                   attr->attr_clr |= MOUNT_ATTR__ATIME;
>                   break;
>               default:
>                   exit_log("Invalid argument specified");
>               }
>           }
>
>           if ((argc - optind) < 2)
>               exit_log("Missing source or target mount point\n");
>
>           const char *source = argv[optind];
>           const char *target = argv[optind + 1];
>
>           int fd_tree = open_tree(-EBADF, source,
>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));

???
What is the significance of -EBADF here? As far as I can tell, it
is not meaningful to open_tree()?


>           if (fd_tree == -1)
>               exit_log("%m - Failed to open %s\n", source);
>
>           if (fd_userns >= 0) {
>               attr->attr_set  |= MOUNT_ATTR_IDMAP;
>               attr->userns_fd = fd_userns;
>           }
>
>           ret = mount_setattr(fd_tree, "",
>                       AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0),
>                       attr, sizeof(struct mount_attr));
>           if (ret == -1)
>               exit_log("%m - Failed to change mount attributes\n");
>
>           close(fd_userns);
>
>           ret = move_mount(fd_tree, "", -EBADF, target,
>                            MOVE_MOUNT_F_EMPTY_PATH);

???
What is the significance of -EBADF here? As far as I can tell, it
is not meaningful to move_mount()?

>           if (ret == -1)
>               exit_log("%m - Failed to attach mount to %s\n", target);
>
>           close(fd_tree);
>
>           exit(EXIT_SUCCESS);
>       }
>
>   SEE ALSO
>       newuidmap(1), newgidmap(1), clone(2), mount(2), unshare(2),
>       proc(5), mount_namespaces(7), capabilities(7),
>       user_namespaces(7), xattr(7)

Thanks,

Michael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10  1:38 Questions re the new mount_setattr(2) manual page Michael Kerrisk (man-pages)
@ 2021-08-10  7:12 ` Michael Kerrisk (man-pages)
  2021-08-10 14:11   ` Christian Brauner
  2021-08-10 14:32 ` Christian Brauner
  2021-08-10 22:47 ` Michael Kerrisk (man-pages)
  2 siblings, 1 reply; 15+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-08-10  7:12 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hi Christian,

One more question...

>>       The propagation field is used to specify the propagation typ
>>       of the mount or mount tree.  Mount propagation options are
>>       mutually exclusive; that is, the propagation values behave
>>       like an enum.  The supported mount propagation types are:

The manual page text doesn't actually say it, but if the 'propagation'
field is 0, then this means leave the propagation type unchanged, 
right? This of course should be mentioned in the manual page.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10  7:12 ` Michael Kerrisk (man-pages)
@ 2021-08-10 14:11   ` Christian Brauner
  2021-08-10 19:30     ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2021-08-10 14:11 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man, Christoph Hellwig

On Tue, Aug 10, 2021 at 09:12:14AM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Christian,
> 
> One more question...
> 
> >>       The propagation field is used to specify the propagation typ
> >>       of the mount or mount tree.  Mount propagation options are
> >>       mutually exclusive; that is, the propagation values behave
> >>       like an enum.  The supported mount propagation types are:
> 
> The manual page text doesn't actually say it, but if the 'propagation'
> field is 0, then this means leave the propagation type unchanged, 
> right? This of course should be mentioned in the manual page.

Yes, if none of the documented values is set the propagation is unchanged.

Christian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10  1:38 Questions re the new mount_setattr(2) manual page Michael Kerrisk (man-pages)
  2021-08-10  7:12 ` Michael Kerrisk (man-pages)
@ 2021-08-10 14:32 ` Christian Brauner
  2021-08-10 21:06   ` Michael Kerrisk (man-pages)
  2021-08-10 22:47 ` Michael Kerrisk (man-pages)
  2 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2021-08-10 14:32 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man, Christoph Hellwig

On Tue, Aug 10, 2021 at 03:38:00AM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Christian,
> 
> Thanks for the very nice manual page that you wrote. I have

Thank you!

> made a large number of (mostly trivial) edits. If you could
> read the page closely, to check that I introduced no errors,
> I would appreciate it.

Happy to!

> 
> I have various questions below, marked ???. Could you please take
> a look at these, and I will then make further edits based on your
> answers.

I've answered all questions, I think. Feel free to just reformulate
where my suggestions weren't adequate. Since most things you ask about
are minor adaptions there's no need from my end for you to resend with
those reformulations. You can just make them directly. :) I'll peruse
the man-pages git repo anyway after you apply them and will send changes
if I spot issues.

Thank you for the review!
Christian

> 
> The current version of the page is already pushed to the man-pages
> Git repo.
> 
> >   MOUNT_SETATTR(2)      Linux Programmer's Manual     MOUNT_SETATTR(2)
> >
> >   NAME
> >       mount_setattr - change mount properties of a mount or mount
> 
> ???
> s/mount properties/properties ?
> 
> (Just bcause more concise.)

Sounds good.

> 
> >       tree
> >
> >   SYNOPSIS
> >       #include <linux/fcntl.h> /* Definition of AT_* constants */
> >       #include <linux/mount.h> /* Definition of MOUNT_ATTR_* constants */
> >       #include <sys/syscall.h> /* Definition of SYS_* constants */
> >       #include <unistd.h>
> >
> >       int syscall(SYS_mount_setattr, int dirfd, const char *path,
> >               unsigned int flags, struct mount_attr *attr, size_t size);
> >
> >       Note: glibc provides no wrapper for mount_setattr(),
> >       necessitating the use of syscall(2).
> >
> >   DESCRIPTION
> >       The mount_setattr() system call changes the mount properties
> >       of a mount or an entire mount tree.  If path is a relative
> >       pathname, then it is interpreted relative to the directory
> >       referred to by the file descriptor dirfd.  If dirfd is the
> >       special value AT_FDCWD, then path is interpreted relative to
> >       the current working directory of the calling process.  If
> >       path is the empty string and AT_EMPTY_PATH is specified in
> >       flags, then the mount properties of the mount identified by
> >       dirfd are changed.
> >
> >       The mount_setattr() system call uses an extensible structure
> >       (struct mount_attr) to allow for future extensions.  Any non-
> >       flag extensions to mount_setattr() will be implemented as new
> >       fields appended to the this structure, with a zero value in a
> >       new field resulting in the kernel behaving as though that
> >       extension field was not present.  Therefore, the caller must
> >       zero-fill this structure on initialization.  See the
> >       "Extensibility" subsection under NOTES for more details.
> >
> >       The size argument should usually be specified as
> >       sizeof(struct mount_attr).  However, if the caller does not
> >       intend to make use of features that got introduced after the
> >       initial version of struct mount_attr, it is possible to pass
> >       the size of the initial struct together with the larger
> >       struct.  This allows the kernel to not copy later parts of
> >       the struct that aren't used anyway.  With each extension that
> >       changes the size of struct mount_attr, the kernel will expose
> >       a definition of the form MOUNT_ATTR_SIZE_VERnumber.  For
> >       example, the macro for the size of the initial version of
> >       struct mount_attr is MOUNT_ATTR_SIZE_VER0.
> 
> ???
> I think I understand the above paragraph, but I wonder if it could
> be improved a little. The general principle is that one can always
> pass the size of an earlier, smaller structure to the kernel, right?

Yes.

> My point is that it need not be the size of the initial structure,
> right? So, I wonder whether a little rewording might be need above.

Yes, the initial structure size is just an example because that will be
very common.

> What do you think?

Sure, I'm proposing something here but please, fell free to reformulate
or come up with something completely new:

	[...]
	However, if the caller is using a kernel that supports an
	extended struct mount_attr but the caller does not intend to
	make use of these features they can pass the size of an earlier
	version of the struct together with the extended structure.
	[...]

> 
> >
> >       The flags argument can be used to alter the path resolution
> >       behavior.  The supported values are:
> >
> >       AT_EMPTY_PATH
> >              If path is the empty string, change the mount
> >              properties on dirfd itself.
> >
> >       AT_RECURSIVE
> >              Change the mount properties of the entire mount tree.
> >
> >       AT_SYMLINK_NOFOLLOW
> >              Don't follow trailing symbolic links.
> >
> >       AT_NO_AUTOMOUNT
> >              Don't trigger automounts.
> >
> >       The attr argument of mount_setattr() is a structure of the
> >       following form:
> >
> >           struct mount_attr {
> >               __u64 attr_set;     /* Mount properties to set */
> >               __u64 attr_clr;     /* Mount properties to clear */
> >               __u64 propagation;  /* Mount propagation type */
> >               __u64 userns_fd;    /* User namespace file descriptor */
> >           };
> >
> >       The attr_set and attr_clr members are used to specify the
> >       mount properties that are supposed to be set or cleared for a
> >       mount or mount tree.  Flags set in attr_set enable a property
> >       on a mount or mount tree, and flags set in attr_clr remove a
> >       property from a mount or mount tree.
> >
> >       When changing mount properties, the kernel will first clear
> >       the flags specified in the attr_clr field, and then set the
> >       flags specified in the attr_set field:
> 
> ???
> I find the following example a bit confusing. See below.
> 
> >
> >           struct mount_attr attr = {
> >               .attr_clr = MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODEV,
> >               .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
> >           };
> 
> ???
> I *think* that what you are trying to show is that the above initializer
> resuts in the equivalent of the following code. Is that correct? If so, 
> I think the text needs some work to make this clearer. Let me know.

Yes, exactly. Feel free to remove that code and just explain it in text
if that's better.

> 
> >           unsigned int current_mnt_flags = mnt->mnt_flags;
> >
> >           /*
> >            * Clear all flags set in .attr_clr,
> >            * clearing MOUNT_ATTR_NOEXEC and MOUNT_ATTR_NODEV.
> >            */
> >           current_mnt_flags &= ~attr->attr_clr;
> >
> >           /*
> >            * Now set all flags set in .attr_set,
> >            * applying MOUNT_ATTR_RDONLY and MOUNT_ATTR_NOSUID.
> >            */
> >           current_mnt_flags |= attr->attr_set;
> >
> >           mnt->mnt_flags = current_mnt_flags;
> >
> >       As a rsult of this change, the mount or mount tree (a) is

Typo: s/rsult/result/g

> >       read-only; (b) blocks the execution of set-user-ID and set-
> >       group-ID programs; (c) allows execution of programs; and (d)
> >       allows access to devices.
> >
> >       Multiple changes with the same set of flags requested in
> >       attr_clr and attr_set are guaranteed to be idempotent after
> >       the changes have been applied.
> >
> >       The following mount attributes can be specified in the
> >       attr_set or attr_clr fields:
> >
> >       MOUNT_ATTR_RDONLY
> >              If set in attr_set, makes the mount read-only.  If set
> >              in attr_clr, removes the read-only setting if set on
> >              the mount.
> >
> >       MOUNT_ATTR_NOSUID
> >              If set in attr_set, causes the mount not to honor the
> >              set-user-ID and set-group-ID mode bits and file
> >              capabilities when executing programs.  If set in
> >              attr_clr, clears the set-user-ID, set-group-ID, and
> >              file capability restriction if set on this mount.
> >
> >       MOUNT_ATTR_NODEV
> >              If set in attr_set, prevents access to devices on this
> >              mount.  If set in attr_clr, removes the restriction
> >              that prevented accessing devices on this mount.
> >
> >       MOUNT_ATTR_NOEXEC
> >              If set in attr_set, prevents executing programs on
> >              this mount.  If set in attr_clr, removes the
> >              restriction that prevented executing programs on this
> >              mount.
> >
> >       MOUNT_ATTR_NOSYMFOLLOW
> >              If set in attr_set, prevents following symbolic links
> >              on this mount.  If set in attr_clr, removes the
> >              restriction that prevented following symbolic links on
> >              this mount.
> >
> >       MOUNT_ATTR_NODIRATIME
> >              If set in attr_set, prevents updating access time for
> >              directories on this mount.  If set in attr_clr,
> >              removes the restriction that prevented updating access
> >              time for directories.  Note that MOUNT_ATTR_NODIRATIME
> >              can be combined with other access-time settings and is
> >              implied by the noatime setting.  All other access-time
> >              settings are mutually exclusive.
> >
> >       MOUNT_ATTR__ATIME - changing access-time settings
> >              In the new mount API, the access-time values are an
> >              enum starting from 0.  Even though they are an enum
> >              (in contrast to the other mount flags such as
> >              MOUNT_ATTR_NOEXEC), they are nonetheless passed in
> >              attr_set and attr_clr for consistency with fsmount(2),
> >              which introduced this behavior.
> >
> >              Note that, since access times are an enum not a bit
> >              map, users wanting to transition to a different
> >              access-time setting cannot simply specify the access-
> >              time setting in attr_set but must also set
> >              MOUNT_ATTR__ATIME in the attr_clr field.  The kernel
> >              will verify that MOUNT_ATTR__ATIME isn't partially set
> >              in attr_clr, and that attr_set doesn't have any
> >              access-time bits set if MOUNT_ATTR__ATIME isn't set in
> >              attr_clr.
> >
> >              MOUNT_ATTR_RELATIME
> >                     When a file is accessed via this mount, update
> >                     the file's last access time (atime) only if the
> >                     current value of atime is less than or equal to
> >                     the file's last modification time (mtime) or
> >                     last status change time (ctime).
> >
> >                     To enable this access-time setting on a mount
> >                     or mount tree, MOUNT_ATTR_RELATIME must be set
> >                     in attr_set and MOUNT_ATTR__ATIME must be set
> >                     in the attr_clr field.
> >
> >              MOUNT_ATTR_NOATIME
> >                     Do not update access times for (all types of)
> >                     files on this mount.
> >
> >                     To enable this access-time setting on a mount
> >                     or mount tree, MOUNT_ATTR_NOATIME must be set
> >                     in attr_set and MOUNT_ATTR__ATIME must be set
> >                     in the attr_clr field.
> >
> >              MOUNT_ATTR_STRICTATIME
> >                     Always update the last access time (atime) when
> >                     files are accessed on this mount.
> >
> >                     To enable this access-time setting on a mount
> >                     or mount tree, MOUNT_ATTR_STRICTATIME must be
> >                     set in attr_set and MOUNT_ATTR__ATIME must be
> >                     set in the attr_clr field.
> >
> >       MOUNT_ATTR_IDMAP
> >              If set in attr_set, creates an ID-mapped mount.  The
> >              ID mapping is taken from the user namespace specified
> 
> In various places, you wrote "idmapping". "idmapped", etc. I've
> changed these to the more natural English "ID mapping" etc.

Sure.

> 
> >              in userns_fd and attached to the mount.
> >
> >              Since it is not supported to change the ID mapping of
> >              a mount after it has been ID mapped, it is invalid to
> >              specify MOUNT_ATTR_IDMAP in attr_clr.
> >
> >              For further details, see the subsection "ID-mapped
> >              mounts" under NOTES.
> >
> >       The propagation field is used to specify the propagation type
> >       of the mount or mount tree.  Mount propagation options are
> >       mutually exclusive; that is, the propagation values behave
> >       like an enum.  The supported mount propagation types are:
> >
> >       MS_PRIVATE
> >              Turn all mounts into private mounts.  Mount and
> >              unmount events do not propagate into or out of this
> >              mount point.
> >
> >       MS_SHARED
> >              Turn all mounts into shared mounts.  Mount points
> >              share events with members of a peer group.  Mount and
> >              unmount events immediately under this mount point will
> >              propagate to the other mount points that are members
> >              of the peer group.  Propagation here means that the
> >              same mount or unmount will automatically occur under
> >              all of the other mount points in the peer group.
> >              Conversely, mount and unmount events that take place
> >              under peer mount points will propagate to this mount
> >              point.
> >
> >       MS_SLAVE
> >              Turn all mounts into dependent mounts.  Mount and
> >              unmount events propagate into this mount point from a
> >              shared peer group.  Mount and unmount events under
> >              this mount point do not propagate to any peer.
> >
> >       MS_UNBINDABLE
> >              This is like a private mount, and in addition this
> >              mount can't be bind mounted.  Attempts to bind mount
> >              this mount will fail.  When a recursive bind mount is
> >              performed on a directory subtree, any bind mounts
> >              within the subtree are automatically pruned (i.e., not
> >              replicated) when replicating that subtree to produce
> >              the target subtree.
> >
> >       For further details on propagation types, see
> >       mount_namespaces(7).
> >
> >   RETURN VALUE
> >       On success, mount_setattr() returns zero.  On error, -1 is
> >       returned and errno is set to indicate the cause of the error.
> >
> >   ERRORS
> >       EBADF  dirfd is not a valid file descriptor.
> >
> >       EBADF  userns_fd is not a valid file descriptor.
> >
> >       EBUSY  The caller tried to change the mount to
> >              MOUNT_ATTR_RDONLY, but the mount still holds files
> >              open for writing.
> >
> >       EINVAL The path specified via the dirfd and path arguments to
> >              mount_setattr() isn't a mount point.
> >
> >       EINVAL An unsupported value was set in flags.
> >
> >       EINVAL An unsupported value was specified in the attr_set
> >              field of mount_attr.
> >
> >       EINVAL An unsupported value was specified in the attr_clr
> >              field of mount_attr.
> >
> >       EINVAL An unsupported value was specified in the propagation
> >              field of mount_attr.
> >
> >       EINVAL More than one of MS_SHARED, MS_SLAVE, MS_PRIVATE, or
> >              MS_UNBINDABLE was set in the the propagation field of
> >              mount_attr.
> >
> >       EINVAL An access-time setting was specified in the attr_set
> >              field without MOUNT_ATTR__ATIME being set in the
> >              attr_clr field.
> >
> >       EINVAL MOUNT_ATTR_IDMAP was specified in attr_clr.
> >
> >       EINVAL A file descriptor value was specified in userns_fd
> >              which exceeds INT_MAX.
> >
> >       EINVAL A valid file descriptor value was specified in
> >              userns_fd, but the file descriptor wasn't a namespace
> >              file descriptor or did not refer to a user namespace.
> 
> ???
> Could the above not be simplified to
> 
>       EINVAL A valid file descriptor value was specified in
>              userns_fd, but the file descriptor did not refer
>              to a user namespace.

Sounds good.

> ?
> 
> >
> >       EINVAL The underlying filesystem does not support ID-mapped
> >              mounts.
> >
> >       EINVAL The mount that is to be ID mapped is not a
> >              detached/anonymous mount; that is, the mount is
> 
> ???
> What is a the distinction between "detached" and "anonymous"?
> Or do you mean them to be synonymous? If so, then let's use
> just one term, and I think "detached" is preferable.

Yes, they are synonymous here. I list both because detached can
potentially be confusing. A detached mount is a mount that has not been
visible in the filesystem. But if you attached it an then unmount it
right after and keep the fd for the mountpoint open it's a detached
mount purely on a natural language level, I'd argue. But it's not a
detached mount from the kernel's view anymore because it has been
exposed in the filesystem and is thus not detached anymore.
But I do prefer "detached" to "anonymous" and that confusion is very
unlikely to occur.

> 
> >              already visible in the filesystem.
> >
> >       EINVAL A partial access-time setting was specified in
> >              attr_clr instead of MOUNT_ATTR__ATIME being set.
> >
> >       EINVAL The mount is located outside the caller's mount
> >              namespace.
> >
> >       EINVAL The underlying filesystem is mounted in a user
> >              namespace.
> >
> >       ENOENT A pathname was empty or had a nonexistent component.
> >
> >       ENOMEM When changing mount propagation to MS_SHARED, a new
> >              peer group ID needs to be allocated for all mounts
> >              without a peer group ID set.  Allocation of this peer
> >              group ID has failed.
> >
> >       ENOSPC When changing mount propagation to MS_SHARED, a new
> >              peer group ID needs to be allocated for all mounts
> >              without a peer group ID set.  Allocation of this peer
> >              group ID can fail.  Note that technically further
> >              error codes are possible that are specific to the ID
> >              allocation implementation used.
> >
> >       EPERM  One of the mounts had at least one of
> >              MOUNT_ATTR_NOATIME, MOUNT_ATTR_NODEV,
> >              MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
> >              MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the
> >              flag is locked.  Mount attributes become locked on a
> >              mount if:
> >
> >              •  A new mount or mount tree is created causing mount
> >                 propagation across user namespaces.  The kernel
> >                 will lock the aforementioned flags to protect these
> >                 sensitive properties from being altered.
> >
> >              •  A new mount and user namespace pair is created.
> >                 This happens for example when specifying
> >                 CLONE_NEWUSER | CLONE_NEWNS in unshare(2),
> >                 clone(2), or clone3(2).  The aforementioned flags
> >                 become locked to protect user namespaces from
> >                 altering sensitive mount properties.
> >
> >       EPERM  A valid file descriptor value was specified in
> >              userns_fd, but the file descriptor refers to the
> >              initial user namespace.
> >
> >       EPERM  An already ID-mapped mount was supposed to be ID
> >              mapped.
> 
> ???
> Better:
>     An attempt was made to add an ID mapping to a mount that is already
>     ID mapped.

Sounds good.

> ?
> 
> >
> >       EPERM  The caller does not have CAP_SYS_ADMIN in the initial
> >              user namespace.
> >
> >   VERSIONS
> >       mount_setattr() first appeared in Linux 5.12.
> >
> >   CONFORMING TO
> >       mount_setattr() is Linux-specific.
> >
> >   NOTES
> >   ID-mapped mounts
> >       Creating an ID-mapped mount makes it possible to change the
> >       ownership of all files located under a mount.  Thus, ID-
> >       mapped mounts make it possible to change ownership in a
> >       temporary and localized way.  It is a localized change
> >       because ownership changes are restricted to a specific mount.
> 
> ???
> Would it be clearer to say something like:
> 
>     It is a localized change because ownership changes are
>     visible only via a specific mount.
> ?

Sounds good.

> 
> 
> >       All other users and locations where the filesystem is exposed
> >       are unaffected.  And it is a temporary change because
> >       ownership changes are tied to the lifetime of the mount.
> >
> >       Whenever callers interact with the filesystem through an ID-
> >       mapped mount, the ID mapping of the mount will be applied to
> >       user and group IDs associated with filesystem objects.  This
> >       encompasses the user and group IDs associated with inodes and
> >       also the following xattr(7) keys:
> >
> >       •  security.capability, whenever filesystem capabilities are
> >          stored or returned in the VFS_CAP_REVISION_3 format, which
> >          stores a root user ID alongside the capabilities (see
> >          capabilities(7)).
> >
> >       •  system.posix_acl_access and system.posix_acl_default,
> >          whenever user IDs or group IDs are stored in ACL_USER or
> >          ACL_GROUP entries.
> >
> >       The following conditions must be met in order to create an
> >       ID-mapped mount:
> >
> >       •  The caller must have the CAP_SYS_ADMIN capability in the
> >          initial user namespace.
> >
> >       •  The filesystem must be mounted in the initial user
> >          namespace.
> 
> ???
> Should this rather be written as:
>  
>      The filesystem must be mounted in a mount namespace 
>      that is owned by the initial user namespace.

Sounds good.

> 
> >       •  The underlying filesystem must support ID-mapped mounts.
> >          Currently, the xfs(5), ext4(5), and FAT filesystems
> >          support ID-mapped mounts with more filesystems being
> >          actively worked on.
> >
> >       •  The mount must not already be ID-mapped.  This also
> >          implies that the ID mapping of a mount cannot be altered.
> >
> >       •  The mount must be a detached/anonymous mount; that is, it
> 
> ???
> See the above questionon "detached" vs "anonymous"

Yes, please use "detached" only.

> 
> >          must have been created by calling open_tree(2) with the
> >          OPEN_TREE_CLONE flag and it must not already have been
> >          visible in the filesystem.
> >
> >       ID mappings can be created for user IDs, group IDs, and
> >       project IDs.  An ID mapping is essentially a mapping of a
> >       range of user or group IDs into another or the same range of
> >       user or group IDs.  ID mappings are usually written as three
> >       numbers either separated by white space or a full stop.  The
> >       first two numbers specify the starting user or group ID in
> >       each of the two user namespaces.  The third number specifies
> >       the range of the ID mapping.  For example, a mapping for user
> >       IDs such as 1000:1001:1 would indicate that user ID 1000 in
> >       the caller's user namespace is mapped to user ID 1001 in its
> >       ancestor user namespace.  Since the map range is 1, only user
> >       ID 1000 is mapped.
> 
> ???
> The details above seem wrong. When writing to map files, the
> fields must be white-space separated, AFAIK. But above you mention
> "full stops" and also show an example using colons (:). Those
> both seem wrong and confusing. Am I missing something?

This is more about notational conventions that exist and not about how
they are actually written. That's something I'm not touching on here as
it doesn't belong on this manpage. But feel free to only mention spaces.

> 
> >       It is possible to specify up to 340 ID mappings for each ID
> >       mapping type.  If any user IDs or group IDs are not mapped,
> >       all files owned by that unmapped user or group ID will appear
> >       as being owned by the overflow user ID or overflow group ID
> >       respectively.
> >
> >       Further details and instructions for setting up ID mappings
> >       can be found in the user_namespaces(7) man page.
> >
> >       In the common case, the user namespace passed in userns_fd
> >       together with MOUNT_ATTR_IDMAP in attr_set to create an ID-
> >       mapped mount will be the user namespace of a container.  In
> >       other scenarios it will be a dedicated user namespace
> >       associated with a user's login session as is the case for
> >       portable home directories in systemd-homed.service(8)).  It
> >       is also perfectly fine to create a dedicated user namespace
> >       for the sake of ID mapping a mount.
> >
> >       ID-mapped mounts can be useful in the following and a variety
> >       of other scenarios:
> >
> >       •  Sharing files between multiple users or multiple machines,
> 
> ???
> s/Sharing files/Sharing filesystems/ ?

[1]: But work. But feel free to use "sharing filesystems".

> 
> >          especially in complex scenarios.  For example, ID-mapped
> >          mounts are used to implement portable home directories in
> >          systemd-homed.service(8), where they allow users to move
> >          their home directory to an external storage device and use
> >          it on multiple computers where they are assigned different
> >          user IDs and group IDs.  This effectively makes it
> >          possible to assign random user IDs and group IDs at login
> >          time.
> >
> >       •  Sharing files from the host with unprivileged containers.
> 
> ???
> s/Sharing files/Sharing filesystems/ ?

See [1].

> 
> >          This allows a user to avoid having to change ownership
> >          permanently through chown(2).
> >
> >       •  ID mapping a container's root filesystem.  Users don't
> >          need to change ownership permanently through chown(2).
> >          Especially for large root filesystems, using chown(2) can
> >          be prohibitively expensive.
> >
> >       •  Sharing files between containers with non-overlapping ID
> 
> ???
> s/Sharing files/Sharing filesystems/ ?

See [1].

> 
> >          mappings.
> >
> >       •  Implementing discretionary access (DAC) permission
> >          checking for filesystems lacking a concept of ownership.
> >
> >       •  Efficiently changing ownership on a per-mount basis.  In
> >          contrast to chown(2), changing ownership of large sets of
> >          files is instantaneous with ID-mapped mounts.  This is
> >          especially useful when ownership of an entire root
> >          filesystem of a virtual machine or container is to be
> >          changed as mentioned above.  With ID-mapped mounts, a
> >          single mount_setattr() system call will be sufficient to
> >          change the ownership of all files.
> >
> >       •  Taking the current ownership into account.  ID mappings
> >          specify precisely what a user or group ID is supposed to
> >          be mapped to.  This contrasts with the chown(2) system
> >          call which cannot by itself take the current ownership of
> >          the files it changes into account.  It simply changes the
> >          ownership to the specified user ID and group ID.
> >
> >       •  Locally and temporarily restricted ownership changes.  ID-
> >          mapped mounts make it possible to change ownership
> >          locally, restricting it to specific mounts, and
> 
> ???
> The referent of "it" in the preceding line is not clear.
> Should it be "the ownership changes"? Or something else?

It should refer to ownership changes. I'd appreciate it if you could
reformulate.

> 
> >          temporarily as the ownership changes only apply as long as
> >          the mount exists.  By contrast, changing ownership via the
> >          chown(2) system call changes the ownership globally and
> >          permanently.
> >
> >   Extensibility
> >       In order to allow for future extensibility, mount_setattr()
> >       requires the user-space application to specify the size of
> >       the mount_attr structure that it is passing.  By providing
> >       this information, it is possible for mount_setattr() to
> >       provide both forwards- and backwards-compatibility, with size
> >       acting as an implicit version number.  (Because new extension
> >       fields will always be appended, the structure size will
> >       always increase.)  This extensibility design is very similar
> >       to other system calls such as perf_setattr(2),
> >       perf_event_open(2), clone3(2) and openat2(2).
> >
> >       Let usize be the size of the structure as specified by the
> >       user-space application, and let ksize be the size of the
> >       structure which the kernel supports, then there are three
> >       cases to consider:
> >
> >       •  If ksize equals usize, then there is no version mismatch
> >          and attr can be used verbatim.
> >
> >       •  If ksize is larger than usize, then there are some
> >          extension fields that the kernel supports which the user-
> >          space application is unaware of.  Because a zero value in
> >          any added extension field signifies a no-op, the kernel
> >          treats all of the extension fields not provided by the
> >          user-space application as having zero values.  This
> >          provides backwards-compatibility.
> >
> >       •  If ksize is smaller than usize, then there are some
> >          extension fields which the user-space application is aware
> >          of but which the kernel does not support.  Because any
> >          extension field must have its zero values signify a no-op,
> >          the kernel can safely ignore the unsupported extension
> >          fields if they are all zero.  If any unsupported extension
> >          fields are non-zero, then -1 is returned and errno is set
> >          to E2BIG.  This provides forwards-compatibility.
> >
> >       Because the definition of struct mount_attr may change in the
> >       future (with new fields being added when system headers are
> >       updated), user-space applications should zero-fill struct
> >       mount_attr to ensure that recompiling the program with new
> >       headers will not result in spurious errors at runtime.  The
> >       simplest way is to use a designated initializer:
> >
> >           struct mount_attr attr = {
> >               .attr_set = MOUNT_ATTR_RDONLY,
> >               .attr_clr = MOUNT_ATTR_NODEV
> >           };
> >
> >       Alternatively, the structure can be zero-filled using
> >       memset(3) or similar functions:
> >
> >           struct mount_attr attr;
> >           memset(&attr, 0, sizeof(attr));
> >           attr.attr_set = MOUNT_ATTR_RDONLY;
> >           attr.attr_clr = MOUNT_ATTR_NODEV;
> >
> >       A user-space application that wishes to determine which
> >       extensions the running kernel supports can do so by
> >       conducting a binary search on size with a structure which has
> >       every byte nonzero (to find the largest value which doesn't
> >       produce an error of E2BIG).
> >
> >   EXAMPLES
> 
> ???
> Do you have a (preferably simple) example piece of code
> somewhere for setting up an ID mapped mount?


> 
> >       /*
> >        * This program allows the caller to create a new detached mount
> >        * and set various properties on it.
> >        */
> >       #define _GNU_SOURCE
> >       #include <errno.h>
> >       #include <fcntl.h>
> >       #include <getopt.h>
> >       #include <linux/mount.h>
> >       #include <linux/types.h>
> >       #include <stdbool.h>
> >       #include <stdio.h>
> >       #include <stdlib.h>
> >       #include <string.h>
> >       #include <sys/syscall.h>
> >       #include <unistd.h>
> >
> >       static inline int
> >       mount_setattr(int dirfd, const char *path, unsigned int flags,
> >                     struct mount_attr *attr, size_t size)
> >       {
> >           return syscall(SYS_mount_setattr, dirfd, path, flags,
> >                          attr, size);
> >       }
> >
> >       static inline int
> >       open_tree(int dirfd, const char *filename, unsigned int flags)
> >       {
> >           return syscall(SYS_open_tree, dirfd, filename, flags);
> >       }
> >
> >       static inline int
> >       move_mount(int from_dirfd, const char *from_pathname,
> >                  int to_dirfd, const char *to_pathname,
> >                  unsigned int flags)
> >       {
> >           return syscall(SYS_move_mount, from_dirfd, from_pathname,
> >                          to_dirfd, to_pathname, flags);
> >       }
> >
> >       static const struct option longopts[] = {
> >           {"map-mount",       required_argument,  NULL,  'a'},
> >           {"recursive",       no_argument,        NULL,  'b'},
> >           {"read-only",       no_argument,        NULL,  'c'},
> >           {"block-setid",     no_argument,        NULL,  'd'},
> >           {"block-devices",   no_argument,        NULL,  'e'},
> >           {"block-exec",      no_argument,        NULL,  'f'},
> >           {"no-access-time",  no_argument,        NULL,  'g'},
> >           { NULL,             0,                  NULL,   0 },
> >       };
> >
> >       #define exit_log(format, ...)  do           \
> >       {                                           \
> >           fprintf(stderr, format, ##__VA_ARGS__); \
> >           exit(EXIT_FAILURE);                     \
> >       } while (0)
> >
> >       int
> >       main(int argc, char *argv[])
> >       {
> >           struct mount_attr *attr = &(struct mount_attr){};
> >           int fd_userns = -EBADF;
> 
> ???
> Why this magic initializer here? Why not just "-1"?
> Using -EBADF makes it look this is value specifically is
> meaningful, although I don't think that's true.

[2]: I always use -EBADF to initialize fds in all my code. It makes it
pretty easy to grep for fd initialization etc. So it's pure visual
convenience. Freel free to just use -1.

> 
> >           bool recursive = false;
> >           int index = 0;
> >           int ret;
> >
> >           while ((ret = getopt_long_only(argc, argv, "",
> >                                          longopts, &index)) != -1) {
> >               switch (ret) {
> >               case 'a':
> >                   fd_userns = open(optarg, O_RDONLY | O_CLOEXEC);
> >                   if (fd_userns == -1)
> >                       exit_log("%m - Failed top open %s\n", optarg);
> >                   break;
> >               case 'b':
> >                   recursive = true;
> >                   break;
> >               case 'c':
> >                   attr->attr_set |= MOUNT_ATTR_RDONLY;
> >                   break;
> >               case 'd':
> >                   attr->attr_set |= MOUNT_ATTR_NOSUID;
> >                   break;
> >               case 'e':
> >                   attr->attr_set |= MOUNT_ATTR_NODEV;
> >                   break;
> >               case 'f':
> >                   attr->attr_set |= MOUNT_ATTR_NOEXEC;
> >                   break;
> >               case 'g':
> >                   attr->attr_set |= MOUNT_ATTR_NOATIME;
> >                   attr->attr_clr |= MOUNT_ATTR__ATIME;
> >                   break;
> >               default:
> >                   exit_log("Invalid argument specified");
> >               }
> >           }
> >
> >           if ((argc - optind) < 2)
> >               exit_log("Missing source or target mount point\n");
> >
> >           const char *source = argv[optind];
> >           const char *target = argv[optind + 1];
> >
> >           int fd_tree = open_tree(-EBADF, source,
> >                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
> >                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
> 
> ???
> What is the significance of -EBADF here? As far as I can tell, it
> is not meaningful to open_tree()?

I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.

> 
> 
> >           if (fd_tree == -1)
> >               exit_log("%m - Failed to open %s\n", source);
> >
> >           if (fd_userns >= 0) {
> >               attr->attr_set  |= MOUNT_ATTR_IDMAP;
> >               attr->userns_fd = fd_userns;
> >           }
> >
> >           ret = mount_setattr(fd_tree, "",
> >                       AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0),
> >                       attr, sizeof(struct mount_attr));
> >           if (ret == -1)
> >               exit_log("%m - Failed to change mount attributes\n");
> >
> >           close(fd_userns);
> >
> >           ret = move_mount(fd_tree, "", -EBADF, target,
> >                            MOVE_MOUNT_F_EMPTY_PATH);
> 
> ???
> What is the significance of -EBADF here? As far as I can tell, it
> is not meaningful to move_mount()?

See [2].

> 
> >           if (ret == -1)
> >               exit_log("%m - Failed to attach mount to %s\n", target);
> >
> >           close(fd_tree);
> >
> >           exit(EXIT_SUCCESS);
> >       }
> >
> >   SEE ALSO
> >       newuidmap(1), newgidmap(1), clone(2), mount(2), unshare(2),
> >       proc(5), mount_namespaces(7), capabilities(7),
> >       user_namespaces(7), xattr(7)
> 
> Thanks,
> 
> Michael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10 14:11   ` Christian Brauner
@ 2021-08-10 19:30     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 15+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-08-10 19:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

On 8/10/21 4:11 PM, Christian Brauner wrote:
> On Tue, Aug 10, 2021 at 09:12:14AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Christian,
>>
>> One more question...
>>
>>>>       The propagation field is used to specify the propagation typ
>>>>       of the mount or mount tree.  Mount propagation options are
>>>>       mutually exclusive; that is, the propagation values behave
>>>>       like an enum.  The supported mount propagation types are:
>>
>> The manual page text doesn't actually say it, but if the 'propagation'
>> field is 0, then this means leave the propagation type unchanged, 
>> right? This of course should be mentioned in the manual page.
> 
> Yes, if none of the documented values is set the propagation is unchanged.

Thanks for the confirmation.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10 14:32 ` Christian Brauner
@ 2021-08-10 21:06   ` Michael Kerrisk (man-pages)
  2021-08-11 10:07     ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-08-10 21:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hello Christian,

On 8/10/21 4:32 PM, Christian Brauner wrote:
> On Tue, Aug 10, 2021 at 03:38:00AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Christian,
>>
>> Thanks for the very nice manual page that you wrote. I have
> 
> Thank you!
> 
>> made a large number of (mostly trivial) edits. If you could
>> read the page closely, to check that I introduced no errors,
>> I would appreciate it.
> 
> Happy to!

Thanks for the feedback. I've made some changes, and pushed to Git.

There's still a few open questions. Please see "????" below.

>> I have various questions below, marked ???. Could you please take
>> a look at these, and I will then make further edits based on your
>> answers.
> 
> I've answered all questions, I think. Feel free to just reformulate
> where my suggestions weren't adequate. Since most things you ask about
> are minor adaptions there's no need from my end for you to resend with
> those reformulations. You can just make them directly. :) I'll peruse
> the man-pages git repo anyway after you apply them and will send changes
> if I spot issues.
> 
> Thank you for the review!
> Christian
> 
>>
>> The current version of the page is already pushed to the man-pages
>> Git repo.
>>
>>>   MOUNT_SETATTR(2)      Linux Programmer's Manual     MOUNT_SETATTR(2)
>>>
>>>   NAME
>>>       mount_setattr - change mount properties of a mount or mount
>>
>> ???
>> s/mount properties/properties ?
>>
>> (Just bcause more concise.)
> 
> Sounds good.

Done.

>>
>>>       tree
>>>
>>>   SYNOPSIS
>>>       #include <linux/fcntl.h> /* Definition of AT_* constants */
>>>       #include <linux/mount.h> /* Definition of MOUNT_ATTR_* constants */
>>>       #include <sys/syscall.h> /* Definition of SYS_* constants */
>>>       #include <unistd.h>
>>>
>>>       int syscall(SYS_mount_setattr, int dirfd, const char *path,
>>>               unsigned int flags, struct mount_attr *attr, size_t size);
>>>
>>>       Note: glibc provides no wrapper for mount_setattr(),
>>>       necessitating the use of syscall(2).
>>>
>>>   DESCRIPTION

[...]

>>>       The size argument should usually be specified as
>>>       sizeof(struct mount_attr).  However, if the caller does not
>>>       intend to make use of features that got introduced after the
>>>       initial version of struct mount_attr, it is possible to pass
>>>       the size of the initial struct together with the larger
>>>       struct.  This allows the kernel to not copy later parts of
>>>       the struct that aren't used anyway.  With each extension that
>>>       changes the size of struct mount_attr, the kernel will expose
>>>       a definition of the form MOUNT_ATTR_SIZE_VERnumber.  For
>>>       example, the macro for the size of the initial version of
>>>       struct mount_attr is MOUNT_ATTR_SIZE_VER0.
>>
>> ???
>> I think I understand the above paragraph, but I wonder if it could
>> be improved a little. The general principle is that one can always
>> pass the size of an earlier, smaller structure to the kernel, right?
> 
> Yes.
> 
>> My point is that it need not be the size of the initial structure,
>> right? So, I wonder whether a little rewording might be need above.
> 
> Yes, the initial structure size is just an example because that will be
> very common.
> 
>> What do you think?
> 
> Sure, I'm proposing something here but please, fell free to reformulate
> or come up with something completely new:
> 
> 	[...]
> 	However, if the caller is using a kernel that supports an
> 	extended struct mount_attr but the caller does not intend to
> 	make use of these features they can pass the size of an earlier
> 	version of the struct together with the extended structure.
> 	[...]

Perfect! I took that text pretty much exactly as you gave it.

[...]

>>>       The attr_set and attr_clr members are used to specify the
>>>       mount properties that are supposed to be set or cleared for a
>>>       mount or mount tree.  Flags set in attr_set enable a property
>>>       on a mount or mount tree, and flags set in attr_clr remove a
>>>       property from a mount or mount tree.
>>>
>>>       When changing mount properties, the kernel will first clear
>>>       the flags specified in the attr_clr field, and then set the
>>>       flags specified in the attr_set field:
>>
>> ???
>> I find the following example a bit confusing. See below.
>>
>>>
>>>           struct mount_attr attr = {
>>>               .attr_clr = MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODEV,
>>>               .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
>>>           };
>>
>> ???
>> I *think* that what you are trying to show is that the above initializer
>> resuts in the equivalent of the following code. Is that correct? If so, 
>> I think the text needs some work to make this clearer. Let me know.
> 
> Yes, exactly. Feel free to remove that code and just explain it in text
> if that's better.

I've done some rewording to say that the code snippet shows
the effect of the initializer.

[...]

>>>   RETURN VALUE
>>>       On success, mount_setattr() returns zero.  On error, -1 is
>>>       returned and errno is set to indicate the cause of the error.
>>>
>>>   ERRORS

[...]

>>>       EINVAL A valid file descriptor value was specified in
>>>              userns_fd, but the file descriptor wasn't a namespace
>>>              file descriptor or did not refer to a user namespace.
>>
>> ???
>> Could the above not be simplified to
>>
>>       EINVAL A valid file descriptor value was specified in
>>              userns_fd, but the file descriptor did not refer
>>              to a user namespace.
> 
> Sounds good.
> 
>> ?

Done.

>>>
>>>       EINVAL The underlying filesystem does not support ID-mapped
>>>              mounts.
>>>
>>>       EINVAL The mount that is to be ID mapped is not a
>>>              detached/anonymous mount; that is, the mount is
>>
>> ???
>> What is a the distinction between "detached" and "anonymous"?
>> Or do you mean them to be synonymous? If so, then let's use
>> just one term, and I think "detached" is preferable.
> 
> Yes, they are synonymous here. I list both because detached can
> potentially be confusing. A detached mount is a mount that has not been
> visible in the filesystem. But if you attached it an then unmount it
> right after and keep the fd for the mountpoint open it's a detached
> mount purely on a natural language level, I'd argue. But it's not a
> detached mount from the kernel's view anymore because it has been
> exposed in the filesystem and is thus not detached anymore.
> But I do prefer "detached" to "anonymous" and that confusion is very
> unlikely to occur.

Thanks. I made it "detached". Elsewhere, the page already explains
that a detached mount is one that:

          must have been created by calling open_tree(2) with the
          OPEN_TREE_CLONE flag and it must not already have been
          visible in the filesystem.

Which seems a fine explanation. 

????
But, just a thought... "visible in the filesystem" seems not quite accurate. 
What you really mean I guess is that it must not already have been
/visible in the filesystem hierarchy/previously mounted/something else/,
right?

>>>              already visible in the filesystem.
>>>

[...]

>>>       EPERM  An already ID-mapped mount was supposed to be ID
>>>              mapped.
>>
>> ???
>> Better:
>>     An attempt was made to add an ID mapping to a mount that is already
>>     ID mapped.
> 
> Sounds good.
> 
>> ?

Done.

[...]

>>>   NOTES
>>>   ID-mapped mounts
>>>       Creating an ID-mapped mount makes it possible to change the
>>>       ownership of all files located under a mount.  Thus, ID-
>>>       mapped mounts make it possible to change ownership in a
>>>       temporary and localized way.  It is a localized change
>>>       because ownership changes are restricted to a specific mount.
>>
>> ???
>> Would it be clearer to say something like:
>>
>>     It is a localized change because ownership changes are
>>     visible only via a specific mount.
>> ?
> 
> Sounds good.

Done.

[...]

>>>       The following conditions must be met in order to create an
>>>       ID-mapped mount:
>>>
>>>       •  The caller must have the CAP_SYS_ADMIN capability in the
>>>          initial user namespace.
>>>
>>>       •  The filesystem must be mounted in the initial user
>>>          namespace.
>>
>> ???
>> Should this rather be written as:
>>  
>>      The filesystem must be mounted in a mount namespace 
>>      that is owned by the initial user namespace.
> 
> Sounds good.

Done.

>>>       •  The underlying filesystem must support ID-mapped mounts.
>>>          Currently, the xfs(5), ext4(5), and FAT filesystems
>>>          support ID-mapped mounts with more filesystems being
>>>          actively worked on.
>>>
>>>       •  The mount must not already be ID-mapped.  This also
>>>          implies that the ID mapping of a mount cannot be altered.
>>>
>>>       •  The mount must be a detached/anonymous mount; that is, it
>>
>> ???
>> See the above questionon "detached" vs "anonymous"
> 
> Yes, please use "detached" only.

Done.

>>>          must have been created by calling open_tree(2) with the
>>>          OPEN_TREE_CLONE flag and it must not already have been
>>>          visible in the filesystem.
>>>
>>>       ID mappings can be created for user IDs, group IDs, and
>>>       project IDs.  An ID mapping is essentially a mapping of a
>>>       range of user or group IDs into another or the same range of
>>>       user or group IDs.  ID mappings are usually written as three
>>>       numbers either separated by white space or a full stop.  The
>>>       first two numbers specify the starting user or group ID in
>>>       each of the two user namespaces.  The third number specifies
>>>       the range of the ID mapping.  For example, a mapping for user
>>>       IDs such as 1000:1001:1 would indicate that user ID 1000 in
>>>       the caller's user namespace is mapped to user ID 1001 in its
>>>       ancestor user namespace.  Since the map range is 1, only user
>>>       ID 1000 is mapped.
>>
>> ???
>> The details above seem wrong. When writing to map files, the
>> fields must be white-space separated, AFAIK. But above you mention
>> "full stops" and also show an example using colons (:). Those
>> both seem wrong and confusing. Am I missing something?
> 
> This is more about notational conventions that exist and not about how
> they are actually written. That's something I'm not touching on here as
> it doesn't belong on this manpage. But feel free to only mention spaces.

Thanks for the explanation. In this context though, this could mislead
the reader, so I've removed mention of "full stop" and ":".

>>>       It is possible to specify up to 340 ID mappings for each ID
>>>       mapping type.  If any user IDs or group IDs are not mapped,
>>>       all files owned by that unmapped user or group ID will appear
>>>       as being owned by the overflow user ID or overflow group ID
>>>       respectively.
>>>
>>>       Further details and instructions for setting up ID mappings
>>>       can be found in the user_namespaces(7) man page.
>>>
>>>       In the common case, the user namespace passed in userns_fd
>>>       together with MOUNT_ATTR_IDMAP in attr_set to create an ID-
>>>       mapped mount will be the user namespace of a container.  In
>>>       other scenarios it will be a dedicated user namespace
>>>       associated with a user's login session as is the case for
>>>       portable home directories in systemd-homed.service(8)).  It
>>>       is also perfectly fine to create a dedicated user namespace
>>>       for the sake of ID mapping a mount.

I forgot to mention it earlier, but the following text on the
rationale for ID-mapped mounts is what turns this from a good 
manual page into a great manual page. Thank you for including it.

>>>       ID-mapped mounts can be useful in the following and a variety
>>>       of other scenarios:
>>>
>>>       •  Sharing files between multiple users or multiple machines,
>>
>> ???
>> s/Sharing files/Sharing filesystems/ ?
> 
> [1]: But work. But feel free to use "sharing filesystems".

s/But/Both/

I made it "Sharing files or filesystsms"

>>
>>>          especially in complex scenarios.  For example, ID-mapped
>>>          mounts are used to implement portable home directories in
>>>          systemd-homed.service(8), where they allow users to move
>>>          their home directory to an external storage device and use
>>>          it on multiple computers where they are assigned different
>>>          user IDs and group IDs.  This effectively makes it
>>>          possible to assign random user IDs and group IDs at login
>>>          time.
>>>
>>>       •  Sharing files from the host with unprivileged containers.
>>
>> ???
>> s/Sharing files/Sharing filesystems/ ?
> 
> See [1].

Same.

>>>          This allows a user to avoid having to change ownership
>>>          permanently through chown(2).
>>>
>>>       •  ID mapping a container's root filesystem.  Users don't
>>>          need to change ownership permanently through chown(2).
>>>          Especially for large root filesystems, using chown(2) can
>>>          be prohibitively expensive.
>>>
>>>       •  Sharing files between containers with non-overlapping ID
>>
>> ???
>> s/Sharing files/Sharing filesystems/ ?
> 
> See [1].

Same.

[...]

>>>       •  Locally and temporarily restricted ownership changes.  ID-
>>>          mapped mounts make it possible to change ownership
>>>          locally, restricting it to specific mounts, and
>>
>> ???
>> The referent of "it" in the preceding line is not clear.
>> Should it be "the ownership changes"? Or something else?
> 
> It should refer to ownership changes. I'd appreciate it if you could
> reformulate.

Done.

>>>          temporarily as the ownership changes only apply as long as
>>>          the mount exists.  By contrast, changing ownership via the
>>>          chown(2) system call changes the ownership globally and
>>>          permanently.
>>>
>>>   Extensibility

[...]

>>>   EXAMPLES
>>
>> ???
>> Do you have a (preferably simple) example piece of code
>> somewhere for setting up an ID mapped mount?

????
I guess the best example is this:
https://github.com/brauner/mount-idmapped/
right?

[...]

>>>       int
>>>       main(int argc, char *argv[])
>>>       {
>>>           struct mount_attr *attr = &(struct mount_attr){};
>>>           int fd_userns = -EBADF;
>>
>> ???
>> Why this magic initializer here? Why not just "-1"?
>> Using -EBADF makes it look this is value specifically is
>> meaningful, although I don't think that's true.
> 
> [2]: I always use -EBADF to initialize fds in all my code. It makes it
> pretty easy to grep for fd initialization etc. So it's pure visual
> convenience. Freel free to just use -1.

Changed.

[...]

>>>           int fd_tree = open_tree(-EBADF, source,
>>>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
>>>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
>>
>> ???
>> What is the significance of -EBADF here? As far as I can tell, it
>> is not meaningful to open_tree()?
> 
> I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.

????
But here, both -EBADF and -1 seem to be wrong. This argument 
is a dirfd, and so should either be a file descriptor or the
value AT_FDCWD, right?

>>>           if (fd_tree == -1)
>>>               exit_log("%m - Failed to open %s\n", source);
>>>
>>>           if (fd_userns >= 0) {
>>>               attr->attr_set  |= MOUNT_ATTR_IDMAP;
>>>               attr->userns_fd = fd_userns;
>>>           }
>>>
>>>           ret = mount_setattr(fd_tree, "",
>>>                       AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0),
>>>                       attr, sizeof(struct mount_attr));
>>>           if (ret == -1)
>>>               exit_log("%m - Failed to change mount attributes\n");
>>>
>>>           close(fd_userns);
>>>
>>>           ret = move_mount(fd_tree, "", -EBADF, target,
>>>                            MOVE_MOUNT_F_EMPTY_PATH);
>>
>> ???
>> What is the significance of -EBADF here? As far as I can tell, it
>> is not meaningful to move_mount()?
> 
> See [2].

????
As above, both -EBADF and -1 seem to be wrong. This argument 
is a dirfd, and so should either be a file descriptor or the
value AT_FDCWD, right?

[...]

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10  1:38 Questions re the new mount_setattr(2) manual page Michael Kerrisk (man-pages)
  2021-08-10  7:12 ` Michael Kerrisk (man-pages)
  2021-08-10 14:32 ` Christian Brauner
@ 2021-08-10 22:47 ` Michael Kerrisk (man-pages)
  2021-08-11 10:40   ` Christian Brauner
  2 siblings, 1 reply; 15+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-08-10 22:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hi Christian,

Some further questions...

In ERRORS there is:

       EINVAL The underlying filesystem is mounted in a user namespace.

I don't understand this. What does it mean?

Also, there is this:

       ENOMEM When  changing  mount  propagation to MS_SHARED, a new peer
              group ID needs to be allocated for  all  mounts  without  a
              peer  group  ID  set.  Allocation of this peer group ID has
              failed.

       ENOSPC When changing mount propagation to MS_SHARED,  a  new  peer
              group  ID  needs  to  be allocated for all mounts without a
              peer group ID set.  Allocation of this peer  group  ID  can
              fail.  Note that technically further error codes are possi‐
              ble that are specific to the ID  allocation  implementation
              used.

What is the difference between these two error cases? (That is, in what 
circumstances will one get ENOMEM vs ENOSPC and vice versa?)

And then:

       EPERM  One  of  the mounts had at least one of MOUNT_ATTR_NOATIME,
              MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
              MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
              locked.  Mount attributes become locked on a mount if:

              •  A new mount or mount tree is created causing mount prop‐
                 agation  across  user  namespaces.  The kernel will lock

Propagation is done across mont points, not user namespaces.
should "across user namespaces" be "to a mount namespace owned 
by a different user namespace"? Or something else?

                 the aforementioned  flags  to  protect  these  sensitive
                 properties from being altered.

              •  A  new  mount  and user namespace pair is created.  This
                 happens for  example  when  specifying  CLONE_NEWUSER  |
                 CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).  The
                 aforementioned flags become locked to protect user name‐
                 spaces from altering sensitive mount properties.

Again, this seems imprecise. Should it say something like:
"... to prevent changes to sensitive mount properties in the new 
mount namespace" ? Or perhaps you have a better wording.

Thanks,

Michael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10 21:06   ` Michael Kerrisk (man-pages)
@ 2021-08-11 10:07     ` Christian Brauner
  2021-08-12  5:36       ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2021-08-11 10:07 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man, Christoph Hellwig

On Tue, Aug 10, 2021 at 11:06:52PM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian,
> 
> On 8/10/21 4:32 PM, Christian Brauner wrote:
> > On Tue, Aug 10, 2021 at 03:38:00AM +0200, Michael Kerrisk (man-pages) wrote:
> >> Hi Christian,
> >>
> >> Thanks for the very nice manual page that you wrote. I have
> > 
> > Thank you!
> > 
> >> made a large number of (mostly trivial) edits. If you could
> >> read the page closely, to check that I introduced no errors,
> >> I would appreciate it.
> > 
> > Happy to!
> 
> Thanks for the feedback. I've made some changes, and pushed to Git.
> 
> There's still a few open questions. Please see "????" below.

Sure.

> 
> >> I have various questions below, marked ???. Could you please take
> >> a look at these, and I will then make further edits based on your
> >> answers.
> > 
> > I've answered all questions, I think. Feel free to just reformulate
> > where my suggestions weren't adequate. Since most things you ask about
> > are minor adaptions there's no need from my end for you to resend with
> > those reformulations. You can just make them directly. :) I'll peruse
> > the man-pages git repo anyway after you apply them and will send changes
> > if I spot issues.
> > 
> > Thank you for the review!
> > Christian
> > 
> >>
> >> The current version of the page is already pushed to the man-pages
> >> Git repo.
> >>
> >>>   MOUNT_SETATTR(2)      Linux Programmer's Manual     MOUNT_SETATTR(2)
> >>>
> >>>   NAME
> >>>       mount_setattr - change mount properties of a mount or mount
> >>
> >> ???
> >> s/mount properties/properties ?
> >>
> >> (Just bcause more concise.)
> > 
> > Sounds good.
> 
> Done.
> 
> >>
> >>>       tree
> >>>
> >>>   SYNOPSIS
> >>>       #include <linux/fcntl.h> /* Definition of AT_* constants */
> >>>       #include <linux/mount.h> /* Definition of MOUNT_ATTR_* constants */
> >>>       #include <sys/syscall.h> /* Definition of SYS_* constants */
> >>>       #include <unistd.h>
> >>>
> >>>       int syscall(SYS_mount_setattr, int dirfd, const char *path,
> >>>               unsigned int flags, struct mount_attr *attr, size_t size);
> >>>
> >>>       Note: glibc provides no wrapper for mount_setattr(),
> >>>       necessitating the use of syscall(2).
> >>>
> >>>   DESCRIPTION
> 
> [...]
> 
> >>>       The size argument should usually be specified as
> >>>       sizeof(struct mount_attr).  However, if the caller does not
> >>>       intend to make use of features that got introduced after the
> >>>       initial version of struct mount_attr, it is possible to pass
> >>>       the size of the initial struct together with the larger
> >>>       struct.  This allows the kernel to not copy later parts of
> >>>       the struct that aren't used anyway.  With each extension that
> >>>       changes the size of struct mount_attr, the kernel will expose
> >>>       a definition of the form MOUNT_ATTR_SIZE_VERnumber.  For
> >>>       example, the macro for the size of the initial version of
> >>>       struct mount_attr is MOUNT_ATTR_SIZE_VER0.
> >>
> >> ???
> >> I think I understand the above paragraph, but I wonder if it could
> >> be improved a little. The general principle is that one can always
> >> pass the size of an earlier, smaller structure to the kernel, right?
> > 
> > Yes.
> > 
> >> My point is that it need not be the size of the initial structure,
> >> right? So, I wonder whether a little rewording might be need above.
> > 
> > Yes, the initial structure size is just an example because that will be
> > very common.
> > 
> >> What do you think?
> > 
> > Sure, I'm proposing something here but please, fell free to reformulate
> > or come up with something completely new:
> > 
> > 	[...]
> > 	However, if the caller is using a kernel that supports an
> > 	extended struct mount_attr but the caller does not intend to
> > 	make use of these features they can pass the size of an earlier
> > 	version of the struct together with the extended structure.
> > 	[...]
> 
> Perfect! I took that text pretty much exactly as you gave it.
> 
> [...]
> 
> >>>       The attr_set and attr_clr members are used to specify the
> >>>       mount properties that are supposed to be set or cleared for a
> >>>       mount or mount tree.  Flags set in attr_set enable a property
> >>>       on a mount or mount tree, and flags set in attr_clr remove a
> >>>       property from a mount or mount tree.
> >>>
> >>>       When changing mount properties, the kernel will first clear
> >>>       the flags specified in the attr_clr field, and then set the
> >>>       flags specified in the attr_set field:
> >>
> >> ???
> >> I find the following example a bit confusing. See below.
> >>
> >>>
> >>>           struct mount_attr attr = {
> >>>               .attr_clr = MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODEV,
> >>>               .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
> >>>           };
> >>
> >> ???
> >> I *think* that what you are trying to show is that the above initializer
> >> resuts in the equivalent of the following code. Is that correct? If so, 
> >> I think the text needs some work to make this clearer. Let me know.
> > 
> > Yes, exactly. Feel free to remove that code and just explain it in text
> > if that's better.
> 
> I've done some rewording to say that the code snippet shows
> the effect of the initializer.
> 
> [...]
> 
> >>>   RETURN VALUE
> >>>       On success, mount_setattr() returns zero.  On error, -1 is
> >>>       returned and errno is set to indicate the cause of the error.
> >>>
> >>>   ERRORS
> 
> [...]
> 
> >>>       EINVAL A valid file descriptor value was specified in
> >>>              userns_fd, but the file descriptor wasn't a namespace
> >>>              file descriptor or did not refer to a user namespace.
> >>
> >> ???
> >> Could the above not be simplified to
> >>
> >>       EINVAL A valid file descriptor value was specified in
> >>              userns_fd, but the file descriptor did not refer
> >>              to a user namespace.
> > 
> > Sounds good.
> > 
> >> ?
> 
> Done.
> 
> >>>
> >>>       EINVAL The underlying filesystem does not support ID-mapped
> >>>              mounts.
> >>>
> >>>       EINVAL The mount that is to be ID mapped is not a
> >>>              detached/anonymous mount; that is, the mount is
> >>
> >> ???
> >> What is a the distinction between "detached" and "anonymous"?
> >> Or do you mean them to be synonymous? If so, then let's use
> >> just one term, and I think "detached" is preferable.
> > 
> > Yes, they are synonymous here. I list both because detached can
> > potentially be confusing. A detached mount is a mount that has not been
> > visible in the filesystem. But if you attached it an then unmount it
> > right after and keep the fd for the mountpoint open it's a detached
> > mount purely on a natural language level, I'd argue. But it's not a
> > detached mount from the kernel's view anymore because it has been
> > exposed in the filesystem and is thus not detached anymore.
> > But I do prefer "detached" to "anonymous" and that confusion is very
> > unlikely to occur.
> 
> Thanks. I made it "detached". Elsewhere, the page already explains
> that a detached mount is one that:
> 
>           must have been created by calling open_tree(2) with the
>           OPEN_TREE_CLONE flag and it must not already have been
>           visible in the filesystem.
> 
> Which seems a fine explanation. 
> 
> ????
> But, just a thought... "visible in the filesystem" seems not quite accurate. 
> What you really mean I guess is that it must not already have been
> /visible in the filesystem hierarchy/previously mounted/something else/,
> right?

A detached mount is created via the OPEN_TREE_CLONE flag. It is a
separate new mount so "previously mounted" is not applicable.
A detached mount is _related_ to what the MS_BIND flag gives you with
mount(2). However, they differ conceptually and technically. A MS_BIND
mount(2) is always visible in the fileystem when mount(2) returns, i.e.
it is discoverable by regular path-lookup starting within the
filesystem.

However, a detached mount can be seen as a split of MS_BIND into two
distinct steps:
1. fd_tree = open_tree(OPEN_TREE_CLONE): create a new mount
2. move_mount(fd_tree, <somewhere>):     attach the mount to the filesystem

1. and 2. together give you the equivalent of MS_BIND.
In between 1. and 2. however the mount is detached. For the kernel
"detached" means that an anonymous mount namespace is attached to it
which doen't appear in proc and has a 0 sequence number (Technically,
there's a bit of semantical argument to be made that "attached" and
"detached" are ambiguous as they could also be taken to mean "does or
does not have a parent mount". This ambiguity e.g. appears in
do_move_mount(). That's why the kernel itself calls it an "anonymous
mount". However, an OPEN_TREE_CLONE-detached mount of course doesn't
have a parent mount so it works.).

For userspace it's better to think of detached and attached in terms of
visibility in the filesystem or in a mount namespace. That's more
straightfoward, more relevant, and hits the target in 90% of the cases.

However, the better and clearer picture is to say that a
OPEN_TREE_CLONE-detached mount is a mount that has never been
move_mount()ed. Which in turn can be defined as the detached mount has
never been made visible in a mount namespace. Once that has happened the
mount is irreversibly an attached mount.

I keep thinking that maybe we should just say "anonymous mount"
everywhere. So changing the wording to:

[...]
EINVAL The mount that is to be ID mapped is not an anonymous mount; that is, the mount has already been visible in a mount namespace.
[...]

[...]
The mount must be an anonymous mount; that is, it must have been created by calling open_tree(2) with the OPEN_TREE_CLONE flag and it must not already have been visible in a mount namespace, i.e. it must not have been attached to the filesystem hierarchy with syscalls such as move_mount() syscall.
[...]

(I'm using the formulation "with syscalls such as move_mount()" to
future proof this. :)).

> 
> >>>              already visible in the filesystem.
> >>>
> 
> [...]
> 
> >>>       EPERM  An already ID-mapped mount was supposed to be ID
> >>>              mapped.
> >>
> >> ???
> >> Better:
> >>     An attempt was made to add an ID mapping to a mount that is already
> >>     ID mapped.
> > 
> > Sounds good.
> > 
> >> ?
> 
> Done.
> 
> [...]
> 
> >>>   NOTES
> >>>   ID-mapped mounts
> >>>       Creating an ID-mapped mount makes it possible to change the
> >>>       ownership of all files located under a mount.  Thus, ID-
> >>>       mapped mounts make it possible to change ownership in a
> >>>       temporary and localized way.  It is a localized change
> >>>       because ownership changes are restricted to a specific mount.
> >>
> >> ???
> >> Would it be clearer to say something like:
> >>
> >>     It is a localized change because ownership changes are
> >>     visible only via a specific mount.
> >> ?
> > 
> > Sounds good.
> 
> Done.
> 
> [...]
> 
> >>>       The following conditions must be met in order to create an
> >>>       ID-mapped mount:
> >>>
> >>>       •  The caller must have the CAP_SYS_ADMIN capability in the
> >>>          initial user namespace.
> >>>
> >>>       •  The filesystem must be mounted in the initial user
> >>>          namespace.
> >>
> >> ???
> >> Should this rather be written as:
> >>  
> >>      The filesystem must be mounted in a mount namespace 
> >>      that is owned by the initial user namespace.
> > 
> > Sounds good.
> 
> Done.
> 
> >>>       •  The underlying filesystem must support ID-mapped mounts.
> >>>          Currently, the xfs(5), ext4(5), and FAT filesystems
> >>>          support ID-mapped mounts with more filesystems being
> >>>          actively worked on.
> >>>
> >>>       •  The mount must not already be ID-mapped.  This also
> >>>          implies that the ID mapping of a mount cannot be altered.
> >>>
> >>>       •  The mount must be a detached/anonymous mount; that is, it
> >>
> >> ???
> >> See the above questionon "detached" vs "anonymous"
> > 
> > Yes, please use "detached" only.
> 
> Done.
> 
> >>>          must have been created by calling open_tree(2) with the
> >>>          OPEN_TREE_CLONE flag and it must not already have been
> >>>          visible in the filesystem.
> >>>
> >>>       ID mappings can be created for user IDs, group IDs, and
> >>>       project IDs.  An ID mapping is essentially a mapping of a
> >>>       range of user or group IDs into another or the same range of
> >>>       user or group IDs.  ID mappings are usually written as three
> >>>       numbers either separated by white space or a full stop.  The
> >>>       first two numbers specify the starting user or group ID in
> >>>       each of the two user namespaces.  The third number specifies
> >>>       the range of the ID mapping.  For example, a mapping for user
> >>>       IDs such as 1000:1001:1 would indicate that user ID 1000 in
> >>>       the caller's user namespace is mapped to user ID 1001 in its
> >>>       ancestor user namespace.  Since the map range is 1, only user
> >>>       ID 1000 is mapped.
> >>
> >> ???
> >> The details above seem wrong. When writing to map files, the
> >> fields must be white-space separated, AFAIK. But above you mention
> >> "full stops" and also show an example using colons (:). Those
> >> both seem wrong and confusing. Am I missing something?
> > 
> > This is more about notational conventions that exist and not about how
> > they are actually written. That's something I'm not touching on here as
> > it doesn't belong on this manpage. But feel free to only mention spaces.
> 
> Thanks for the explanation. In this context though, this could mislead
> the reader, so I've removed mention of "full stop" and ":".
> 
> >>>       It is possible to specify up to 340 ID mappings for each ID
> >>>       mapping type.  If any user IDs or group IDs are not mapped,
> >>>       all files owned by that unmapped user or group ID will appear
> >>>       as being owned by the overflow user ID or overflow group ID
> >>>       respectively.
> >>>
> >>>       Further details and instructions for setting up ID mappings
> >>>       can be found in the user_namespaces(7) man page.
> >>>
> >>>       In the common case, the user namespace passed in userns_fd
> >>>       together with MOUNT_ATTR_IDMAP in attr_set to create an ID-
> >>>       mapped mount will be the user namespace of a container.  In
> >>>       other scenarios it will be a dedicated user namespace
> >>>       associated with a user's login session as is the case for
> >>>       portable home directories in systemd-homed.service(8)).  It
> >>>       is also perfectly fine to create a dedicated user namespace
> >>>       for the sake of ID mapping a mount.
> 
> I forgot to mention it earlier, but the following text on the
> rationale for ID-mapped mounts is what turns this from a good 
> manual page into a great manual page. Thank you for including it.

Thank you for saying that. Appreciate it.

> 
> >>>       ID-mapped mounts can be useful in the following and a variety
> >>>       of other scenarios:
> >>>
> >>>       •  Sharing files between multiple users or multiple machines,
> >>
> >> ???
> >> s/Sharing files/Sharing filesystems/ ?
> > 
> > [1]: But work. But feel free to use "sharing filesystems".
> 
> s/But/Both/
> 
> I made it "Sharing files or filesystsms"
> 
> >>
> >>>          especially in complex scenarios.  For example, ID-mapped
> >>>          mounts are used to implement portable home directories in
> >>>          systemd-homed.service(8), where they allow users to move
> >>>          their home directory to an external storage device and use
> >>>          it on multiple computers where they are assigned different
> >>>          user IDs and group IDs.  This effectively makes it
> >>>          possible to assign random user IDs and group IDs at login
> >>>          time.
> >>>
> >>>       •  Sharing files from the host with unprivileged containers.
> >>
> >> ???
> >> s/Sharing files/Sharing filesystems/ ?
> > 
> > See [1].
> 
> Same.
> 
> >>>          This allows a user to avoid having to change ownership
> >>>          permanently through chown(2).
> >>>
> >>>       •  ID mapping a container's root filesystem.  Users don't
> >>>          need to change ownership permanently through chown(2).
> >>>          Especially for large root filesystems, using chown(2) can
> >>>          be prohibitively expensive.
> >>>
> >>>       •  Sharing files between containers with non-overlapping ID
> >>
> >> ???
> >> s/Sharing files/Sharing filesystems/ ?
> > 
> > See [1].
> 
> Same.
> 
> [...]
> 
> >>>       •  Locally and temporarily restricted ownership changes.  ID-
> >>>          mapped mounts make it possible to change ownership
> >>>          locally, restricting it to specific mounts, and
> >>
> >> ???
> >> The referent of "it" in the preceding line is not clear.
> >> Should it be "the ownership changes"? Or something else?
> > 
> > It should refer to ownership changes. I'd appreciate it if you could
> > reformulate.
> 
> Done.
> 
> >>>          temporarily as the ownership changes only apply as long as
> >>>          the mount exists.  By contrast, changing ownership via the
> >>>          chown(2) system call changes the ownership globally and
> >>>          permanently.
> >>>
> >>>   Extensibility
> 
> [...]
> 
> >>>   EXAMPLES
> >>
> >> ???
> >> Do you have a (preferably simple) example piece of code
> >> somewhere for setting up an ID mapped mount?
> 
> ????
> I guess the best example is this:
> https://github.com/brauner/mount-idmapped/
> right?

Ah yes, sorry. I forgot to answer that yesterday. I sent you links via
another medium but I repeat it here.
There are two places. The link you have here is a private repo. But I've
also merged a program alongside the fstests testsuite I merged:
https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/src/idmapped-mounts/mount-idmapped.c
which should be nicer and has seen reviews by Amir and Christoph.

> 
> [...]
> 
> >>>       int
> >>>       main(int argc, char *argv[])
> >>>       {
> >>>           struct mount_attr *attr = &(struct mount_attr){};
> >>>           int fd_userns = -EBADF;
> >>
> >> ???
> >> Why this magic initializer here? Why not just "-1"?
> >> Using -EBADF makes it look this is value specifically is
> >> meaningful, although I don't think that's true.
> > 
> > [2]: I always use -EBADF to initialize fds in all my code. It makes it
> > pretty easy to grep for fd initialization etc. So it's pure visual
> > convenience. Freel free to just use -1.
> 
> Changed.
> 
> [...]
> 
> >>>           int fd_tree = open_tree(-EBADF, source,
> >>>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
> >>>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
> >>
> >> ???
> >> What is the significance of -EBADF here? As far as I can tell, it
> >> is not meaningful to open_tree()?
> > 
> > I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.
> 
> ????
> But here, both -EBADF and -1 seem to be wrong. This argument 
> is a dirfd, and so should either be a file descriptor or the
> value AT_FDCWD, right?

[1]: In this code "source" is expected to be absolute. If it's not
     absolute we should fail. This can be achieved by passing -1/-EBADF,
     afaict.

> 
> >>>           if (fd_tree == -1)
> >>>               exit_log("%m - Failed to open %s\n", source);
> >>>
> >>>           if (fd_userns >= 0) {
> >>>               attr->attr_set  |= MOUNT_ATTR_IDMAP;
> >>>               attr->userns_fd = fd_userns;
> >>>           }
> >>>
> >>>           ret = mount_setattr(fd_tree, "",
> >>>                       AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0),
> >>>                       attr, sizeof(struct mount_attr));
> >>>           if (ret == -1)
> >>>               exit_log("%m - Failed to change mount attributes\n");
> >>>
> >>>           close(fd_userns);
> >>>
> >>>           ret = move_mount(fd_tree, "", -EBADF, target,
> >>>                            MOVE_MOUNT_F_EMPTY_PATH);
> >>
> >> ???
> >> What is the significance of -EBADF here? As far as I can tell, it
> >> is not meaningful to move_mount()?
> > 
> > See [2].
> 
> ????
> As above, both -EBADF and -1 seem to be wrong. This argument 
> is a dirfd, and so should either be a file descriptor or the
> value AT_FDCWD, right?

See [1].

Thanks!
Christian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10 22:47 ` Michael Kerrisk (man-pages)
@ 2021-08-11 10:40   ` Christian Brauner
  2021-08-12  5:36     ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2021-08-11 10:40 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man, Christoph Hellwig

On Wed, Aug 11, 2021 at 12:47:14AM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Christian,
> 
> Some further questions...
> 
> In ERRORS there is:
> 
>        EINVAL The underlying filesystem is mounted in a user namespace.
> 
> I don't understand this. What does it mean?

The underlying filesystem has been mounted in a mount namespace that is
owned by a non-initial user namespace (Think of sysfs, overlayfs etc.).

> 
> Also, there is this:
> 
>        ENOMEM When  changing  mount  propagation to MS_SHARED, a new peer
>               group ID needs to be allocated for  all  mounts  without  a
>               peer  group  ID  set.  Allocation of this peer group ID has
>               failed.
> 
>        ENOSPC When changing mount propagation to MS_SHARED,  a  new  peer
>               group  ID  needs  to  be allocated for all mounts without a
>               peer group ID set.  Allocation of this peer  group  ID  can
>               fail.  Note that technically further error codes are possi‐
>               ble that are specific to the ID  allocation  implementation
>               used.
> 
> What is the difference between these two error cases? (That is, in what 
> circumstances will one get ENOMEM vs ENOSPC and vice versa?)

I did really wonder whether to even include those errors and I regret
having included them because they aren't worth a detailed discussion as
I'd consider them kernel internal relevant errors rather than userspace
relevant errors. In essence, peer group ids are allocated using the id
infrastructure of the kernel. It can fail for two main reasons:

1. ENOMEM there's not enough memory to allocate the relevant internal
   structures needed for the bitmap.
2. ENOSPC we ran out of ids, i.e. someone has somehow managed to
   allocate so many peer groups and managed to keep the kernel running
   (???) that the ida has ran out of ids.

Feel free to just drop those errors.

> 
> And then:
> 
>        EPERM  One  of  the mounts had at least one of MOUNT_ATTR_NOATIME,
>               MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
>               MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
>               locked.  Mount attributes become locked on a mount if:
> 
>               •  A new mount or mount tree is created causing mount prop‐
>                  agation  across  user  namespaces.  The kernel will lock
> 
> Propagation is done across mont points, not user namespaces.
> should "across user namespaces" be "to a mount namespace owned 
> by a different user namespace"? Or something else?

That's really splitting hairs. Of course this means that we're
propagating into a mount namespace that is owned by a different user
namespace though "crossing user namespaces" might have been the better
choice.

> 
>                  the aforementioned  flags  to  protect  these  sensitive
>                  properties from being altered.
> 
>               •  A  new  mount  and user namespace pair is created.  This
>                  happens for  example  when  specifying  CLONE_NEWUSER  |
>                  CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).  The
>                  aforementioned flags become locked to protect user name‐
>                  spaces from altering sensitive mount properties.
> 
> Again, this seems imprecise. Should it say something like:
> "... to prevent changes to sensitive mount properties in the new 
> mount namespace" ? Or perhaps you have a better wording.

That's not imprecise. What you want to protect against is altering
sensitive mount properties from within a user namespace irrespective of
whether or not the user namespace actually owns the mount namespace,
i.e. even if you own the mount namespace you shouldn't be able to alter
those properties. I concede though that "protect" should've been
"prevent".

You could probably say:

	A  new  mount  and user namespace pair is created.  This
	happens for  example  when  specifying  CLONE_NEWUSER  |
	CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).
	The aforementioned flags become locked in the new mount
	namespace to prevent sensitive mount properties from being
	altered.
	Since the newly created mount namespace will be owned by the
	newly created user namespace a caller privileged in the newly
	created user namespace would be able to alter senstive
	mount properties. For example, without locking the read-only
	property for the mounts in the new mount namespace such a caller
	would be able to remount them read-write.

(Fwiw, in this scenario there's a bit of (moderately sane) strangeness.
 A CLONE_NEWUSER | CLONE_NEWMNT will cause even stronger protection to
 kick in. For all mounts not marked as expired MNT_LOCKED will be set
 which means that a umount() on any such mount copied from the previous
 mount namespace will yield EINVAL implying from userspace' perspective
 it's not mounted - granted EINVAL is the ioctl() of multiplexing errnos
 - whereas a remount to alter a locked flag will yield EPERM.)

Christian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-11 10:07     ` Christian Brauner
@ 2021-08-12  5:36       ` Michael Kerrisk (man-pages)
  2021-08-12  9:08         ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-08-12  5:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hello Christian,

Thanks for the answers.

A couple of small queries still below.

On 8/11/21 12:07 PM, Christian Brauner wrote:
> On Tue, Aug 10, 2021 at 11:06:52PM +0200, Michael Kerrisk (man-pages) wrote:

[...]

>>>>>       EINVAL The mount that is to be ID mapped is not a
>>>>>              detached/anonymous mount; that is, the mount is
>>>>
>>>> ???
>>>> What is a the distinction between "detached" and "anonymous"?
>>>> Or do you mean them to be synonymous? If so, then let's use
>>>> just one term, and I think "detached" is preferable.
>>>
>>> Yes, they are synonymous here. I list both because detached can
>>> potentially be confusing. A detached mount is a mount that has not been
>>> visible in the filesystem. But if you attached it an then unmount it
>>> right after and keep the fd for the mountpoint open it's a detached
>>> mount purely on a natural language level, I'd argue. But it's not a
>>> detached mount from the kernel's view anymore because it has been
>>> exposed in the filesystem and is thus not detached anymore.
>>> But I do prefer "detached" to "anonymous" and that confusion is very
>>> unlikely to occur.
>>
>> Thanks. I made it "detached". Elsewhere, the page already explains
>> that a detached mount is one that:
>>
>>           must have been created by calling open_tree(2) with the
>>           OPEN_TREE_CLONE flag and it must not already have been
>>           visible in the filesystem.
>>
>> Which seems a fine explanation. 
>>
>> ????
>> But, just a thought... "visible in the filesystem" seems not quite accurate. 
>> What you really mean I guess is that it must not already have been
>> /visible in the filesystem hierarchy/previously mounted/something else/,
>> right?

I suppose that I should have clarified that my main problem was
that you were using the word "filesystem" in a way that I find
unconventional/ambiguous. I mean, I normally take the term
"filesystem" to be "a storage system for folding files".
Here, you are using "filesystem" to mean something else, what 
I might call like "the single directory hierarchy" or "the
filesystem hierarchy" or "the list of mount points".

> A detached mount is created via the OPEN_TREE_CLONE flag. It is a
> separate new mount so "previously mounted" is not applicable.
> A detached mount is _related_ to what the MS_BIND flag gives you with
> mount(2). However, they differ conceptually and technically. A MS_BIND
> mount(2) is always visible in the fileystem when mount(2) returns, i.e.
> it is discoverable by regular path-lookup starting within the
> filesystem.
> 
> However, a detached mount can be seen as a split of MS_BIND into two
> distinct steps:
> 1. fd_tree = open_tree(OPEN_TREE_CLONE): create a new mount
> 2. move_mount(fd_tree, <somewhere>):     attach the mount to the filesystem
> 
> 1. and 2. together give you the equivalent of MS_BIND.
> In between 1. and 2. however the mount is detached. For the kernel
> "detached" means that an anonymous mount namespace is attached to it
> which doen't appear in proc and has a 0 sequence number (Technically,
> there's a bit of semantical argument to be made that "attached" and
> "detached" are ambiguous as they could also be taken to mean "does or
> does not have a parent mount". This ambiguity e.g. appears in
> do_move_mount(). That's why the kernel itself calls it an "anonymous
> mount". However, an OPEN_TREE_CLONE-detached mount of course doesn't
> have a parent mount so it works.).
> 
> For userspace it's better to think of detached and attached in terms of
> visibility in the filesystem or in a mount namespace. That's more
> straightfoward, more relevant, and hits the target in 90% of the cases.
> 
> However, the better and clearer picture is to say that a
> OPEN_TREE_CLONE-detached mount is a mount that has never been
> move_mount()ed. Which in turn can be defined as the detached mount has
> never been made visible in a mount namespace. Once that has happened the
> mount is irreversibly an attached mount.
> 
> I keep thinking that maybe we should just say "anonymous mount"
> everywhere. So changing the wording to:

I'm not against the word "detached". To user space, I think it is a
little more meaningful than "anonymous". For the moment, I'll stay with
"detached", but if you insist on "anonymous", I'll probably change it.

> [...]
> EINVAL The mount that is to be ID mapped is not an anonymous mount;
> that is, the mount has already been visible in a mount namespace.

I like that text *a lot* better! Thanks very much for suggesting
wordings. It makes my life much easier. 

I've made the text:

       EINVAL The mount that is to be ID mapped is not a detached
              mount; that is, the mount has not previously been
              visible in a mount namespace.

> [...]
> The mount must be an anonymous mount; that is, it must have been
> created by calling open_tree(2) with the OPEN_TREE_CLONE flag and it
> must not already have been visible in a mount namespace, i.e. it must
> not have been attached to the filesystem hierarchy with syscalls such
> as move_mount() syscall.

And that too! I've made the text:

       •  The mount must be a detached mount; that is, it must have
          been created by calling open_tree(2) with the
          OPEN_TREE_CLONE flag and it must not already have been
          visible in a mount namespace.  (To put things another way:
          the mount must not have been attached to the filesystem
          hierarchy with a system call such as move_mount(2).)

> [...]
> 
> (I'm using the formulation "with syscalls such as move_mount()" to
> future proof this. :)).

Fair enough.

>>>>>   EXAMPLES
>>>>
>>>> ???
>>>> Do you have a (preferably simple) example piece of code
>>>> somewhere for setting up an ID mapped mount?
>>
>> ????
>> I guess the best example is this:
>> https://github.com/brauner/mount-idmapped/
>> right?
> 
> Ah yes, sorry. I forgot to answer that yesterday. I sent you links via
> another medium but I repeat it here.
> There are two places. The link you have here is a private repo. But I've
> also merged a program alongside the fstests testsuite I merged:
> https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/src/idmapped-mounts/mount-idmapped.c
> which should be nicer and has seen reviews by Amir and Christoph.

Thanks.

[...]

>>>>>           int fd_tree = open_tree(-EBADF, source,
>>>>>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
>>>>>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
>>>>
>>>> ???
>>>> What is the significance of -EBADF here? As far as I can tell, it
>>>> is not meaningful to open_tree()?
>>>
>>> I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.
>>
>> ????
>> But here, both -EBADF and -1 seem to be wrong. This argument 
>> is a dirfd, and so should either be a file descriptor or the
>> value AT_FDCWD, right?
> 
> [1]: In this code "source" is expected to be absolute. If it's not
>      absolute we should fail. This can be achieved by passing -1/-EBADF,
>      afaict.

D'oh! Okay. I hadn't considered that use case for an invalid dirfd.
(And now I've done some adjustments to openat(2),which contains a
rationale for the *at() functions.)

So, now I understand your purpose, but still the code is obscure,
since

* You use a magic value (-EBADF) rather than (say) -1.
* There's no explanation (comment about) of the fact that you want
  to prevent relative pathnames.

So, I've changed the code to use -1, not -EBADF, and I've added some
comments to explain that the intent is to prevent relative pathnames.
Okay?

But, there is still the meta question: what's the problem with using
a relative pathname?

[...]

>>>>>           ret = move_mount(fd_tree, "", -EBADF, target,
>>>>>                            MOVE_MOUNT_F_EMPTY_PATH);
>>>>
>>>> ???
>>>> What is the significance of -EBADF here? As far as I can tell, it
>>>> is not meaningful to move_mount()?
>>>
>>> See [2].
>>
>> ????
>> As above, both -EBADF and -1 seem to be wrong. This argument 
>> is a dirfd, and so should either be a file descriptor or the
>> value AT_FDCWD, right?
> 
> See [1].

I made the same change as above.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-11 10:40   ` Christian Brauner
@ 2021-08-12  5:36     ` Michael Kerrisk (man-pages)
  2021-08-12  8:38       ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-08-12  5:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig, Eric W. Biederman

[CC += Eric, in case he has a comment on the last piece]

Hi Christian,

(A few questions below.)

On 8/11/21 12:40 PM, Christian Brauner wrote:
> On Wed, Aug 11, 2021 at 12:47:14AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Christian,
>>
>> Some further questions...
>>
>> In ERRORS there is:
>>
>>        EINVAL The underlying filesystem is mounted in a user namespace.
>>
>> I don't understand this. What does it mean?
> 
> The underlying filesystem has been mounted in a mount namespace that is
> owned by a non-initial user namespace (Think of sysfs, overlayfs etc.).

Thanks!

>> Also, there is this:
>>
>>        ENOMEM When  changing  mount  propagation to MS_SHARED, a new peer
>>               group ID needs to be allocated for  all  mounts  without  a
>>               peer  group  ID  set.  Allocation of this peer group ID has
>>               failed.
>>
>>        ENOSPC When changing mount propagation to MS_SHARED,  a  new  peer
>>               group  ID  needs  to  be allocated for all mounts without a
>>               peer group ID set.  Allocation of this peer  group  ID  can
>>               fail.  Note that technically further error codes are possi‐
>>               ble that are specific to the ID  allocation  implementation
>>               used.
>>
>> What is the difference between these two error cases? (That is, in what 
>> circumstances will one get ENOMEM vs ENOSPC and vice versa?)
> 
> I did really wonder whether to even include those errors and I regret
> having included them because they aren't worth a detailed discussion as
> I'd consider them kernel internal relevant errors rather than userspace
> relevant errors. In essence, peer group ids are allocated using the id
> infrastructure of the kernel. It can fail for two main reasons:
> 
> 1. ENOMEM there's not enough memory to allocate the relevant internal
>    structures needed for the bitmap.
> 2. ENOSPC we ran out of ids, i.e. someone has somehow managed to
>    allocate so many peer groups and managed to keep the kernel running
>    (???) that the ida has ran out of ids.
> 
> Feel free to just drop those errors.

Because they can at least theoretically be visible to user space, I
prefer to keep them. But I've reworked a bit:

       ENOMEM When changing mount propagation to MS_SHARED, a new
              peer group ID needs to be allocated for all mounts
              without a peer group ID set.  This allocation failed
              because there was not enough memory to allocate the
              relevant internal structures.

       ENOSPC When changing mount propagation to MS_SHARED, a new
              peer group ID needs to be allocated for all mounts
              without a peer group ID set.  This allocation failed
              because the kernel has run out of IDs.

>> And then:
>>
>>        EPERM  One  of  the mounts had at least one of MOUNT_ATTR_NOATIME,
>>               MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
>>               MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
>>               locked.  Mount attributes become locked on a mount if:
>>
>>               •  A new mount or mount tree is created causing mount prop‐
>>                  agation  across  user  namespaces.  The kernel will lock
>>
>> Propagation is done across mont points, not user namespaces.
>> should "across user namespaces" be "to a mount namespace owned 
>> by a different user namespace"? Or something else?
> 
> That's really splitting hairs.

To be clear, I'm not trying to split hairs :-). It's just that
I'm struggling a little to understand. (In particular, the notion
of locked mounts is one where my understanding is weak.) 

And think of it like this: I am the first line of defense for the
user-space reader. If I am having trouble to understand the text,
I wont be alone. And often, the problem is not so much that the
text is "wrong", it's that there's a difference in background
knowledge between what you know and what the reader (in this case
me) knows. Part of my task is to fill that gap, by adding info
that I think is necessary to the page (with the happy side
effect that I learn along the way.)

> Of course this means that we're
> propagating into a mount namespace that is owned by a different user
> namespace though "crossing user namespaces" might have been the better
> choice.

This is a perfect example of the point I make above. You say "of course",
but I don't have the background knowledge that you do :-). From my
perspective, I want to make sure that I understand your meaning, so
that that meaning can (IMHO) be made easier for the average reader
of the manual page.

>>                  the aforementioned  flags  to  protect  these  sensitive
>>                  properties from being altered.
>>
>>               •  A  new  mount  and user namespace pair is created.  This
>>                  happens for  example  when  specifying  CLONE_NEWUSER  |
>>                  CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).  The
>>                  aforementioned flags become locked to protect user name‐
>>                  spaces from altering sensitive mount properties.
>>
>> Again, this seems imprecise. Should it say something like:
>> "... to prevent changes to sensitive mount properties in the new 
>> mount namespace" ? Or perhaps you have a better wording.
> 
> That's not imprecise. 

Okay -- poor choice of wording on my part:

s/this seems imprecise/I'm having trouble understanding this/

> What you want to protect against is altering
> sensitive mount properties from within a user namespace irrespective of
> whether or not the user namespace actually owns the mount namespace,
> i.e. even if you own the mount namespace you shouldn't be able to alter
> those properties. I concede though that "protect" should've been
> "prevent".

Can I check my education here please. The point is this:

* The mount point was created in a mount NS that was owned by
  a more privileged user NS (e.g., the initial user NS).
* A CLONE_NEWUSER|CLONE_NEWNS step occurs to create a new (user and) 
  mount NS.
* In the new mount NS, the mounts become locked.

And, help me here: is it correct that the reason the properties
need to be locked is because they are shared between the mounts?

> You could probably say:
> 
> 	A  new  mount  and user namespace pair is created.  This
> 	happens for  example  when  specifying  CLONE_NEWUSER  |
> 	CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).
> 	The aforementioned flags become locked in the new mount
> 	namespace to prevent sensitive mount properties from being
> 	altered.
> 	Since the newly created mount namespace will be owned by the
> 	newly created user namespace a caller privileged in the newly
> 	created user namespace would be able to alter senstive
> 	mount properties. For example, without locking the read-only
> 	property for the mounts in the new mount namespace such a caller
> 	would be able to remount them read-write.

So, I've now made the text:

       EPERM  One of the mounts had at least one of MOUNT_ATTR_NOATIME,
              MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
              MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
              locked.  Mount attributes become locked on a mount if:

              •  A new mount or mount tree is created causing mount
                 propagation across user namespaces (i.e., propagation to
                 a mount namespace owned by a different user namespace).
                 The kernel will lock the aforementioned flags to prevent
                 these sensitive properties from being altered.

              •  A new mount and user namespace pair is created.  This
                 happens for example when specifying CLONE_NEWUSER |
                 CLONE_NEWNS in unshare(2), clone(2), or clone3(2).  The
                 aforementioned flags become locked in the new mount
                 namespace to prevent sensitive mount properties from
                 being altered.  Since the newly created mount namespace
                 will be owned by the newly created user namespace, a
                 calling process that is privileged in the new user
                 namespace would—in the absence of such locking—be able
                 to alter senstive mount properties (e.g., to remount a
                 mount that was marked read-only as read-write in the new
                 mount namespace).

Okay?

> (Fwiw, in this scenario there's a bit of (moderately sane) strangeness.
>  A CLONE_NEWUSER | CLONE_NEWMNT will cause even stronger protection to
>  kick in. For all mounts not marked as expired MNT_LOCKED will be set
>  which means that a umount() on any such mount copied from the previous
>  mount namespace will yield EINVAL implying from userspace' perspective
>  it's not mounted - granted EINVAL is the ioctl() of multiplexing errnos
>  - whereas a remount to alter a locked flag will yield EPERM.)

Thanks for educating me! So, is that what we are seeing below?

$ sudo umount /mnt/m1
$ sudo mount -t tmpfs none /mnt/m1
$ sudo unshare -pf -Ur -m --mount-proc strace -o /tmp/log umount /mnt/m1
umount: /mnt/m1: not mounted.
$ grep ^umount /tmp/log
umount2("/mnt/m1", 0)                   = -1 EINVAL (Invalid argument)

The mount_namespaces(7) page has for a log time had this text:

       *  Mounts that come as a single unit from a more privileged mount
          namespace are locked together and may not be separated in a
          less privileged mount namespace.  (The unshare(2) CLONE_NEWNS
          operation brings across all of the mounts from the original
          mount namespace as a single unit, and recursive mounts that
          propagate between mount namespaces propagate as a single unit.)

I have had trouble understanding that. But maybe you just helped.
Is that text relevant to what you just wrote above? In particular,
I have trouble understanding what "separated" means. But, perhaps
is means "separately unmounted"? (I added Eric in CC,
in case he has something to say.)

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-12  5:36     ` Michael Kerrisk (man-pages)
@ 2021-08-12  8:38       ` Christian Brauner
  2021-08-13  1:25         ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2021-08-12  8:38 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig, Eric W. Biederman

On Thu, Aug 12, 2021 at 07:36:54AM +0200, Michael Kerrisk (man-pages) wrote:
> [CC += Eric, in case he has a comment on the last piece]
> 
> Hi Christian,
> 
> (A few questions below.)
> 
> On 8/11/21 12:40 PM, Christian Brauner wrote:
> > On Wed, Aug 11, 2021 at 12:47:14AM +0200, Michael Kerrisk (man-pages) wrote:
> >> Hi Christian,
> >>
> >> Some further questions...
> >>
> >> In ERRORS there is:
> >>
> >>        EINVAL The underlying filesystem is mounted in a user namespace.
> >>
> >> I don't understand this. What does it mean?
> > 
> > The underlying filesystem has been mounted in a mount namespace that is
> > owned by a non-initial user namespace (Think of sysfs, overlayfs etc.).
> 
> Thanks!
> 
> >> Also, there is this:
> >>
> >>        ENOMEM When  changing  mount  propagation to MS_SHARED, a new peer
> >>               group ID needs to be allocated for  all  mounts  without  a
> >>               peer  group  ID  set.  Allocation of this peer group ID has
> >>               failed.
> >>
> >>        ENOSPC When changing mount propagation to MS_SHARED,  a  new  peer
> >>               group  ID  needs  to  be allocated for all mounts without a
> >>               peer group ID set.  Allocation of this peer  group  ID  can
> >>               fail.  Note that technically further error codes are possi‐
> >>               ble that are specific to the ID  allocation  implementation
> >>               used.
> >>
> >> What is the difference between these two error cases? (That is, in what 
> >> circumstances will one get ENOMEM vs ENOSPC and vice versa?)
> > 
> > I did really wonder whether to even include those errors and I regret
> > having included them because they aren't worth a detailed discussion as
> > I'd consider them kernel internal relevant errors rather than userspace
> > relevant errors. In essence, peer group ids are allocated using the id
> > infrastructure of the kernel. It can fail for two main reasons:
> > 
> > 1. ENOMEM there's not enough memory to allocate the relevant internal
> >    structures needed for the bitmap.
> > 2. ENOSPC we ran out of ids, i.e. someone has somehow managed to
> >    allocate so many peer groups and managed to keep the kernel running
> >    (???) that the ida has ran out of ids.
> > 
> > Feel free to just drop those errors.
> 
> Because they can at least theoretically be visible to user space, I
> prefer to keep them. But I've reworked a bit:
> 
>        ENOMEM When changing mount propagation to MS_SHARED, a new
>               peer group ID needs to be allocated for all mounts
>               without a peer group ID set.  This allocation failed
>               because there was not enough memory to allocate the
>               relevant internal structures.
> 
>        ENOSPC When changing mount propagation to MS_SHARED, a new
>               peer group ID needs to be allocated for all mounts
>               without a peer group ID set.  This allocation failed
>               because the kernel has run out of IDs.
> 
> >> And then:
> >>
> >>        EPERM  One  of  the mounts had at least one of MOUNT_ATTR_NOATIME,
> >>               MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
> >>               MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
> >>               locked.  Mount attributes become locked on a mount if:
> >>
> >>               •  A new mount or mount tree is created causing mount prop‐
> >>                  agation  across  user  namespaces.  The kernel will lock
> >>
> >> Propagation is done across mont points, not user namespaces.
> >> should "across user namespaces" be "to a mount namespace owned 
> >> by a different user namespace"? Or something else?
> > 
> > That's really splitting hairs.
> 
> To be clear, I'm not trying to split hairs :-). It's just that
> I'm struggling a little to understand. (In particular, the notion
> of locked mounts is one where my understanding is weak.) 
> 
> And think of it like this: I am the first line of defense for the
> user-space reader. If I am having trouble to understand the text,
> I wont be alone. And often, the problem is not so much that the
> text is "wrong", it's that there's a difference in background
> knowledge between what you know and what the reader (in this case
> me) knows. Part of my task is to fill that gap, by adding info
> that I think is necessary to the page (with the happy side
> effect that I learn along the way.)

All very good points.
I didn't mean to complain btw. Sorry that it seemed that way. :)

> 
> > Of course this means that we're
> > propagating into a mount namespace that is owned by a different user
> > namespace though "crossing user namespaces" might have been the better
> > choice.
> 
> This is a perfect example of the point I make above. You say "of course",
> but I don't have the background knowledge that you do :-). From my
> perspective, I want to make sure that I understand your meaning, so
> that that meaning can (IMHO) be made easier for the average reader
> of the manual page.
> 
> >>                  the aforementioned  flags  to  protect  these  sensitive
> >>                  properties from being altered.
> >>
> >>               •  A  new  mount  and user namespace pair is created.  This
> >>                  happens for  example  when  specifying  CLONE_NEWUSER  |
> >>                  CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).  The
> >>                  aforementioned flags become locked to protect user name‐
> >>                  spaces from altering sensitive mount properties.
> >>
> >> Again, this seems imprecise. Should it say something like:
> >> "... to prevent changes to sensitive mount properties in the new 
> >> mount namespace" ? Or perhaps you have a better wording.
> > 
> > That's not imprecise. 
> 
> Okay -- poor choice of wording on my part:
> 
> s/this seems imprecise/I'm having trouble understanding this/
> 
> > What you want to protect against is altering
> > sensitive mount properties from within a user namespace irrespective of
> > whether or not the user namespace actually owns the mount namespace,
> > i.e. even if you own the mount namespace you shouldn't be able to alter
> > those properties. I concede though that "protect" should've been
> > "prevent".
> 
> Can I check my education here please. The point is this:
> 
> * The mount point was created in a mount NS that was owned by
>   a more privileged user NS (e.g., the initial user NS).
> * A CLONE_NEWUSER|CLONE_NEWNS step occurs to create a new (user and) 
>   mount NS.
> * In the new mount NS, the mounts become locked.
> 
> And, help me here: is it correct that the reason the properties
> need to be locked is because they are shared between the mounts?

Yes, basically.
The new mount namespace contains a copy of all the mounts in the
previous mount namespace. So they are separate mounts which you can best
see when you do unshare --mount --propagation=private. An unmount in the
new mount namespace won't affect the mount in the previous mount
namespace. Which can only nicely work if they are separate mounts.
Propagation relies (among other things) on the fact that mount
namespaces have copies of the mounts.

The copied mounts in the new mount namespace will have inherited all
properties they had at the time when copy_namespaces() and specifically
copy_mnt_ns() was called. Which calls into copy_tree() and ultimately
into the appropriately named clone_mnt(). This is the low-level routine
that is responsible for cloning the mounts including their mount
properties.

Some mount properties such as read-only, nodev, noexec, nosuid, atime -
while arguably not per se security mechanisms - are used for protection
or as security measures in userspace applications. The most obvious one
might be the read-only property. One wouldn't want to expose a set of
files as read-only only for someone else to trivially gain write access
to them. An example of where that could happen is when creating a new
mount namespaces and user namespace pair where the new mount namespace
is owned by the new user namespace in which the caller is privileged and
thus the caller would also able to alter the new mount namespace. So
without locking flags all it would take to turn a read-only into a
read-write mount is:
unshare -U --map-root --propagation=private -- mount -o remount,rw /some/mnt
locking such flags prevents that from happening.

> 
> > You could probably say:
> > 
> > 	A  new  mount  and user namespace pair is created.  This
> > 	happens for  example  when  specifying  CLONE_NEWUSER  |
> > 	CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).
> > 	The aforementioned flags become locked in the new mount
> > 	namespace to prevent sensitive mount properties from being
> > 	altered.
> > 	Since the newly created mount namespace will be owned by the
> > 	newly created user namespace a caller privileged in the newly
> > 	created user namespace would be able to alter senstive
> > 	mount properties. For example, without locking the read-only
> > 	property for the mounts in the new mount namespace such a caller
> > 	would be able to remount them read-write.
> 
> So, I've now made the text:
> 
>        EPERM  One of the mounts had at least one of MOUNT_ATTR_NOATIME,
>               MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
>               MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
>               locked.  Mount attributes become locked on a mount if:
> 
>               •  A new mount or mount tree is created causing mount
>                  propagation across user namespaces (i.e., propagation to
>                  a mount namespace owned by a different user namespace).
>                  The kernel will lock the aforementioned flags to prevent
>                  these sensitive properties from being altered.
> 
>               •  A new mount and user namespace pair is created.  This
>                  happens for example when specifying CLONE_NEWUSER |
>                  CLONE_NEWNS in unshare(2), clone(2), or clone3(2).  The
>                  aforementioned flags become locked in the new mount
>                  namespace to prevent sensitive mount properties from
>                  being altered.  Since the newly created mount namespace
>                  will be owned by the newly created user namespace, a
>                  calling process that is privileged in the new user
>                  namespace would—in the absence of such locking—be able
>                  to alter senstive mount properties (e.g., to remount a
>                  mount that was marked read-only as read-write in the new
>                  mount namespace).
> 
> Okay?

Sounds good.

> 
> > (Fwiw, in this scenario there's a bit of (moderately sane) strangeness.
> >  A CLONE_NEWUSER | CLONE_NEWMNT will cause even stronger protection to
> >  kick in. For all mounts not marked as expired MNT_LOCKED will be set
> >  which means that a umount() on any such mount copied from the previous
> >  mount namespace will yield EINVAL implying from userspace' perspective
> >  it's not mounted - granted EINVAL is the ioctl() of multiplexing errnos
> >  - whereas a remount to alter a locked flag will yield EPERM.)
> 
> Thanks for educating me! So, is that what we are seeing below?
> 
> $ sudo umount /mnt/m1
> $ sudo mount -t tmpfs none /mnt/m1
> $ sudo unshare -pf -Ur -m --mount-proc strace -o /tmp/log umount /mnt/m1
> umount: /mnt/m1: not mounted.
> $ grep ^umount /tmp/log
> umount2("/mnt/m1", 0)                   = -1 EINVAL (Invalid argument)
> 
> The mount_namespaces(7) page has for a log time had this text:
> 
>        *  Mounts that come as a single unit from a more privileged mount
>           namespace are locked together and may not be separated in a
>           less privileged mount namespace.  (The unshare(2) CLONE_NEWNS
>           operation brings across all of the mounts from the original
>           mount namespace as a single unit, and recursive mounts that
>           propagate between mount namespaces propagate as a single unit.)
> 
> I have had trouble understanding that. But maybe you just helped.
> Is that text relevant to what you just wrote above? In particular,
> I have trouble understanding what "separated" means. But, perhaps

The text gives the "how" not the "why".
Consider a more elaborate mount tree where e.g., you have bind-mounted a
mount over a subdirectory of another mount:

sudo mount -t tmpfs /mnt
sudo mkdir /mnt/my-dir/
sudo touch /mnt/my-dir/my-file
sudo mount --bind /opt /mnt/my-dir

The files underneath /mnt/my-dir are now hidden. Consider what would
happen if one would allow to address those mounts separately. A user
could then do:

unshare -U --map-root --mount
umount /mnt/my-dir
cat /mnt/my-dir/my-file

giving them access to what's in my-dir.

Treating such mount trees as a unit in less privileged mount namespaces
(cf. [1]) prevents that, i.e., prevents revealing files and directories
that were overmounted.

Treating such mounts as a unit is also relevant when e.g. bind-mounting
a mount tree containing locked mounts. Sticking with the example above:

unshare -U --map-root --mount

# non-recursive bind-mount will fail
mount --bind /mnt /tmp

# recursive bind-mount will succeed
mount --rbind /mnt /tmp

The reason is again that the mount tree at /mnt is treated as a mount
unit because it is locked. If one were to allow to non-recursively
bind-mountng /mnt somewhere it would mean revealing what's underneath
the mount at my-dir (This is in some sense the inverse of preventing a
filesystem from being mounted that isn't fully visible, i.e. contains
hidden or over-mounted mounts.).

These semantics, in addition to being security relevant, also allow a
more privileged mount namespace to create a restricted view of the
filesystem hierarchy that can't be circumvented in a less privileged
mount namespace (Otherwise pivot_root would have to be used which can
also be used to guarantee a restriced view on the filesystem hierarchy
especially when combined with a separate rootfs.).

Christian

[1]: I'll avoid jumping through the hoops of speaking about ownership
     all the time now for the sake of brevity. Otherwise I'll still sit
     here at lunchtime.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-12  5:36       ` Michael Kerrisk (man-pages)
@ 2021-08-12  9:08         ` Christian Brauner
  2021-08-12 22:32           ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2021-08-12  9:08 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man, Christoph Hellwig

On Thu, Aug 12, 2021 at 07:36:24AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian,
> 
> Thanks for the answers.
> 
> A couple of small queries still below.
> 
> On 8/11/21 12:07 PM, Christian Brauner wrote:
> > On Tue, Aug 10, 2021 at 11:06:52PM +0200, Michael Kerrisk (man-pages) wrote:
> 
> [...]
> 
> >>>>>       EINVAL The mount that is to be ID mapped is not a
> >>>>>              detached/anonymous mount; that is, the mount is
> >>>>
> >>>> ???
> >>>> What is a the distinction between "detached" and "anonymous"?
> >>>> Or do you mean them to be synonymous? If so, then let's use
> >>>> just one term, and I think "detached" is preferable.
> >>>
> >>> Yes, they are synonymous here. I list both because detached can
> >>> potentially be confusing. A detached mount is a mount that has not been
> >>> visible in the filesystem. But if you attached it an then unmount it
> >>> right after and keep the fd for the mountpoint open it's a detached
> >>> mount purely on a natural language level, I'd argue. But it's not a
> >>> detached mount from the kernel's view anymore because it has been
> >>> exposed in the filesystem and is thus not detached anymore.
> >>> But I do prefer "detached" to "anonymous" and that confusion is very
> >>> unlikely to occur.
> >>
> >> Thanks. I made it "detached". Elsewhere, the page already explains
> >> that a detached mount is one that:
> >>
> >>           must have been created by calling open_tree(2) with the
> >>           OPEN_TREE_CLONE flag and it must not already have been
> >>           visible in the filesystem.
> >>
> >> Which seems a fine explanation. 
> >>
> >> ????
> >> But, just a thought... "visible in the filesystem" seems not quite accurate. 
> >> What you really mean I guess is that it must not already have been
> >> /visible in the filesystem hierarchy/previously mounted/something else/,
> >> right?
> 
> I suppose that I should have clarified that my main problem was
> that you were using the word "filesystem" in a way that I find
> unconventional/ambiguous. I mean, I normally take the term
> "filesystem" to be "a storage system for folding files".
> Here, you are using "filesystem" to mean something else, what 
> I might call like "the single directory hierarchy" or "the
> filesystem hierarchy" or "the list of mount points".
> 
> > A detached mount is created via the OPEN_TREE_CLONE flag. It is a
> > separate new mount so "previously mounted" is not applicable.
> > A detached mount is _related_ to what the MS_BIND flag gives you with
> > mount(2). However, they differ conceptually and technically. A MS_BIND
> > mount(2) is always visible in the fileystem when mount(2) returns, i.e.
> > it is discoverable by regular path-lookup starting within the
> > filesystem.
> > 
> > However, a detached mount can be seen as a split of MS_BIND into two
> > distinct steps:
> > 1. fd_tree = open_tree(OPEN_TREE_CLONE): create a new mount
> > 2. move_mount(fd_tree, <somewhere>):     attach the mount to the filesystem
> > 
> > 1. and 2. together give you the equivalent of MS_BIND.
> > In between 1. and 2. however the mount is detached. For the kernel
> > "detached" means that an anonymous mount namespace is attached to it
> > which doen't appear in proc and has a 0 sequence number (Technically,
> > there's a bit of semantical argument to be made that "attached" and
> > "detached" are ambiguous as they could also be taken to mean "does or
> > does not have a parent mount". This ambiguity e.g. appears in
> > do_move_mount(). That's why the kernel itself calls it an "anonymous
> > mount". However, an OPEN_TREE_CLONE-detached mount of course doesn't
> > have a parent mount so it works.).
> > 
> > For userspace it's better to think of detached and attached in terms of
> > visibility in the filesystem or in a mount namespace. That's more
> > straightfoward, more relevant, and hits the target in 90% of the cases.
> > 
> > However, the better and clearer picture is to say that a
> > OPEN_TREE_CLONE-detached mount is a mount that has never been
> > move_mount()ed. Which in turn can be defined as the detached mount has
> > never been made visible in a mount namespace. Once that has happened the
> > mount is irreversibly an attached mount.
> > 
> > I keep thinking that maybe we should just say "anonymous mount"
> > everywhere. So changing the wording to:
> 
> I'm not against the word "detached". To user space, I think it is a
> little more meaningful than "anonymous". For the moment, I'll stay with
> "detached", but if you insist on "anonymous", I'll probably change it.

No, sounds good.

> 
> > [...]
> > EINVAL The mount that is to be ID mapped is not an anonymous mount;
> > that is, the mount has already been visible in a mount namespace.
> 
> I like that text *a lot* better! Thanks very much for suggesting
> wordings. It makes my life much easier. 
> 
> I've made the text:
> 
>        EINVAL The mount that is to be ID mapped is not a detached
>               mount; that is, the mount has not previously been
>               visible in a mount namespace.

Sounds good.

> 
> > [...]
> > The mount must be an anonymous mount; that is, it must have been
> > created by calling open_tree(2) with the OPEN_TREE_CLONE flag and it
> > must not already have been visible in a mount namespace, i.e. it must
> > not have been attached to the filesystem hierarchy with syscalls such
> > as move_mount() syscall.
> 
> And that too! I've made the text:
> 
>        •  The mount must be a detached mount; that is, it must have
>           been created by calling open_tree(2) with the
>           OPEN_TREE_CLONE flag and it must not already have been
>           visible in a mount namespace.  (To put things another way:
>           the mount must not have been attached to the filesystem
>           hierarchy with a system call such as move_mount(2).)

Sounds good.

> 
> > [...]
> > 
> > (I'm using the formulation "with syscalls such as move_mount()" to
> > future proof this. :)).
> 
> Fair enough.
> 
> >>>>>   EXAMPLES
> >>>>
> >>>> ???
> >>>> Do you have a (preferably simple) example piece of code
> >>>> somewhere for setting up an ID mapped mount?
> >>
> >> ????
> >> I guess the best example is this:
> >> https://github.com/brauner/mount-idmapped/
> >> right?
> > 
> > Ah yes, sorry. I forgot to answer that yesterday. I sent you links via
> > another medium but I repeat it here.
> > There are two places. The link you have here is a private repo. But I've
> > also merged a program alongside the fstests testsuite I merged:
> > https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/src/idmapped-mounts/mount-idmapped.c
> > which should be nicer and has seen reviews by Amir and Christoph.
> 
> Thanks.
> 
> [...]
> 
> >>>>>           int fd_tree = open_tree(-EBADF, source,
> >>>>>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
> >>>>>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
> >>>>
> >>>> ???
> >>>> What is the significance of -EBADF here? As far as I can tell, it
> >>>> is not meaningful to open_tree()?
> >>>
> >>> I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.
> >>
> >> ????
> >> But here, both -EBADF and -1 seem to be wrong. This argument 
> >> is a dirfd, and so should either be a file descriptor or the
> >> value AT_FDCWD, right?
> > 
> > [1]: In this code "source" is expected to be absolute. If it's not
> >      absolute we should fail. This can be achieved by passing -1/-EBADF,
> >      afaict.
> 
> D'oh! Okay. I hadn't considered that use case for an invalid dirfd.
> (And now I've done some adjustments to openat(2),which contains a
> rationale for the *at() functions.)
> 
> So, now I understand your purpose, but still the code is obscure,
> since
> 
> * You use a magic value (-EBADF) rather than (say) -1.
> * There's no explanation (comment about) of the fact that you want
>   to prevent relative pathnames.
> 
> So, I've changed the code to use -1, not -EBADF, and I've added some
> comments to explain that the intent is to prevent relative pathnames.
> Okay?

Sounds good.

> 
> But, there is still the meta question: what's the problem with using
> a relative pathname?

Nothing per se. Ok, you asked so it's your fault:
When writing programs I like to never use relative paths with AT_FDCWD
because. Because making assumptions about the current working directory
of the calling process is just too easy to get wrong; especially when
pivot_root() or chroot() are in play.
My absolut preference (joke intended) is to open a well-known starting
point with an absolute path to get a dirfd and then scope all future
operations beneath that dirfd. This already works with old-style
openat() and _very_ cautious programming but openat2() and its
resolve-flag space have made this **chef's kiss**.
If I can't operate based on a well-known dirfd I use absolute paths with
a -EBADF dirfd passed to *at() functions.

Christian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-12  9:08         ` Christian Brauner
@ 2021-08-12 22:32           ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 15+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-08-12 22:32 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hi Christian,

[...]

Thanks for checking the various wordinfs.

[...]

>>>>>>>           int fd_tree = open_tree(-EBADF, source,
>>>>>>>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
>>>>>>>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
>>>>>>
>>>>>> ???
>>>>>> What is the significance of -EBADF here? As far as I can tell, it
>>>>>> is not meaningful to open_tree()?
>>>>>
>>>>> I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.
>>>>
>>>> ????
>>>> But here, both -EBADF and -1 seem to be wrong. This argument 
>>>> is a dirfd, and so should either be a file descriptor or the
>>>> value AT_FDCWD, right?
>>>
>>> [1]: In this code "source" is expected to be absolute. If it's not
>>>      absolute we should fail. This can be achieved by passing -1/-EBADF,
>>>      afaict.
>>
>> D'oh! Okay. I hadn't considered that use case for an invalid dirfd.
>> (And now I've done some adjustments to openat(2),which contains a
>> rationale for the *at() functions.)
>>
>> So, now I understand your purpose, but still the code is obscure,
>> since
>>
>> * You use a magic value (-EBADF) rather than (say) -1.
>> * There's no explanation (comment about) of the fact that you want
>>   to prevent relative pathnames.
>>
>> So, I've changed the code to use -1, not -EBADF, and I've added some
>> comments to explain that the intent is to prevent relative pathnames.
>> Okay?
> 
> Sounds good.
> 
>>
>> But, there is still the meta question: what's the problem with using
>> a relative pathname?
> 
> Nothing per se. Ok, you asked so it's your fault:
> When writing programs I like to never use relative paths with AT_FDCWD
> because. Because making assumptions about the current working directory
> of the calling process is just too easy to get wrong; especially when
> pivot_root() or chroot() are in play.
> My absolut preference (joke intended) is to open a well-known starting
> point with an absolute path to get a dirfd and then scope all future
> operations beneath that dirfd. This already works with old-style
> openat() and _very_ cautious programming but openat2() and its
> resolve-flag space have made this **chef's kiss**.
> If I can't operate based on a well-known dirfd I use absolute paths with
> a -EBADF dirfd passed to *at() functions.

Thanks for the clarification. I've noted your rationale in a 
comment in the manual page source so that future maintainers 
will not be puzzled!

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-12  8:38       ` Christian Brauner
@ 2021-08-13  1:25         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 15+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-08-13  1:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig, Eric W. Biederman

Hello Christian,

On 8/12/21 10:38 AM, Christian Brauner wrote:
> On Thu, Aug 12, 2021 at 07:36:54AM +0200, Michael Kerrisk (man-pages) wrote:
>> [CC += Eric, in case he has a comment on the last piece]

[...]

>>> That's really splitting hairs.
>>
>> To be clear, I'm not trying to split hairs :-). It's just that
>> I'm struggling a little to understand. (In particular, the notion
>> of locked mounts is one where my understanding is weak.) 
>>
>> And think of it like this: I am the first line of defense for the
>> user-space reader. If I am having trouble to understand the text,
>> I wont be alone. And often, the problem is not so much that the
>> text is "wrong", it's that there's a difference in background
>> knowledge between what you know and what the reader (in this case
>> me) knows. Part of my task is to fill that gap, by adding info
>> that I think is necessary to the page (with the happy side
>> effect that I learn along the way.)
> 
> All very good points.
> I didn't mean to complain btw. Sorry that it seemed that way. :)

No problem. I need to think more carefully about my words 
sometimes in mails too :-)

>>> Of course this means that we're
>>> propagating into a mount namespace that is owned by a different user
>>> namespace though "crossing user namespaces" might have been the better
>>> choice.
>>
>> This is a perfect example of the point I make above. You say "of course",
>> but I don't have the background knowledge that you do :-). From my
>> perspective, I want to make sure that I understand your meaning, so
>> that that meaning can (IMHO) be made easier for the average reader
>> of the manual page.
>>
>>>>                  the aforementioned  flags  to  protect  these  sensitive
>>>>                  properties from being altered.
>>>>
>>>>               •  A  new  mount  and user namespace pair is created.  This
>>>>                  happens for  example  when  specifying  CLONE_NEWUSER  |
>>>>                  CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).  The
>>>>                  aforementioned flags become locked to protect user name‐
>>>>                  spaces from altering sensitive mount properties.
>>>>
>>>> Again, this seems imprecise. Should it say something like:
>>>> "... to prevent changes to sensitive mount properties in the new 
>>>> mount namespace" ? Or perhaps you have a better wording.
>>>
>>> That's not imprecise. 
>>
>> Okay -- poor choice of wording on my part:
>>
>> s/this seems imprecise/I'm having trouble understanding this/
>>
>>> What you want to protect against is altering
>>> sensitive mount properties from within a user namespace irrespective of
>>> whether or not the user namespace actually owns the mount namespace,
>>> i.e. even if you own the mount namespace you shouldn't be able to alter
>>> those properties. I concede though that "protect" should've been
>>> "prevent".
>>
>> Can I check my education here please. The point is this:
>>
>> * The mount point was created in a mount NS that was owned by
>>   a more privileged user NS (e.g., the initial user NS).
>> * A CLONE_NEWUSER|CLONE_NEWNS step occurs to create a new (user and) 
>>   mount NS.
>> * In the new mount NS, the mounts become locked.
>>
>> And, help me here: is it correct that the reason the properties
>> need to be locked is because they are shared between the mounts?
> 
> Yes, basically.

Yes, but that last sentence of mine was wrong, wasn't it? The 
properties are not actually shared between the mounts, right?
(Earlier, I had done in experiment which misled e into thinking
there was sharing, but now it looks to me like there is not.)

> The new mount namespace contains a copy of all the mounts in the
> previous mount namespace. So they are separate mounts which you can best
> see when you do unshare --mount --propagation=private. An unmount in the
> new mount namespace won't affect the mount in the previous mount
> namespace. Which can only nicely work if they are separate mounts.
> Propagation relies (among other things) on the fact that mount
> namespaces have copies of the mounts.
> 
> The copied mounts in the new mount namespace will have inherited all
> properties they had at the time when copy_namespaces() and specifically
> copy_mnt_ns() was called. Which calls into copy_tree() and ultimately
> into the appropriately named clone_mnt(). This is the low-level routine
> that is responsible for cloning the mounts including their mount
> properties.
> 
> Some mount properties such as read-only, nodev, noexec, nosuid, atime -
> while arguably not per se security mechanisms - are used for protection
> or as security measures in userspace applications. The most obvious one
> might be the read-only property. One wouldn't want to expose a set of
> files as read-only only for someone else to trivially gain write access
> to them. An example of where that could happen is when creating a new
> mount namespaces and user namespace pair where the new mount namespace
> is owned by the new user namespace in which the caller is privileged and
> thus the caller would also able to alter the new mount namespace. So
> without locking flags all it would take to turn a read-only into a
> read-write mount is:
> unshare -U --map-root --propagation=private -- mount -o remount,rw /some/mnt
> locking such flags prevents that from happening.

Thanks for the detailed explanation; it's very helpful.

>>> You could probably say:
>>>
>>> 	A  new  mount  and user namespace pair is created.  This
>>> 	happens for  example  when  specifying  CLONE_NEWUSER  |
>>> 	CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).
>>> 	The aforementioned flags become locked in the new mount
>>> 	namespace to prevent sensitive mount properties from being
>>> 	altered.
>>> 	Since the newly created mount namespace will be owned by the
>>> 	newly created user namespace a caller privileged in the newly
>>> 	created user namespace would be able to alter senstive
>>> 	mount properties. For example, without locking the read-only
>>> 	property for the mounts in the new mount namespace such a caller
>>> 	would be able to remount them read-write.
>>
>> So, I've now made the text:
>>
>>        EPERM  One of the mounts had at least one of MOUNT_ATTR_NOATIME,
>>               MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
>>               MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
>>               locked.  Mount attributes become locked on a mount if:
>>
>>               •  A new mount or mount tree is created causing mount
>>                  propagation across user namespaces (i.e., propagation to
>>                  a mount namespace owned by a different user namespace).
>>                  The kernel will lock the aforementioned flags to prevent
>>                  these sensitive properties from being altered.
>>
>>               •  A new mount and user namespace pair is created.  This
>>                  happens for example when specifying CLONE_NEWUSER |
>>                  CLONE_NEWNS in unshare(2), clone(2), or clone3(2).  The
>>                  aforementioned flags become locked in the new mount
>>                  namespace to prevent sensitive mount properties from
>>                  being altered.  Since the newly created mount namespace
>>                  will be owned by the newly created user namespace, a
>>                  calling process that is privileged in the new user
>>                  namespace would—in the absence of such locking—be able
>>                  to alter senstive mount properties (e.g., to remount a
>>                  mount that was marked read-only as read-write in the new
>>                  mount namespace).
>>
>> Okay?
> 
> Sounds good.

Okay.

>>> (Fwiw, in this scenario there's a bit of (moderately sane) strangeness.
>>>  A CLONE_NEWUSER | CLONE_NEWMNT will cause even stronger protection to
>>>  kick in. For all mounts not marked as expired MNT_LOCKED will be set
>>>  which means that a umount() on any such mount copied from the previous
>>>  mount namespace will yield EINVAL implying from userspace' perspective
>>>  it's not mounted - granted EINVAL is the ioctl() of multiplexing errnos
>>>  - whereas a remount to alter a locked flag will yield EPERM.)
>>
>> Thanks for educating me! So, is that what we are seeing below?

(Was your silence to the above question an implicit "yes"?)

>> $ sudo umount /mnt/m1
>> $ sudo mount -t tmpfs none /mnt/m1
>> $ sudo unshare -pf -Ur -m --mount-proc strace -o /tmp/log umount /mnt/m1
>> umount: /mnt/m1: not mounted.
>> $ grep ^umount /tmp/log
>> umount2("/mnt/m1", 0)                   = -1 EINVAL (Invalid argument)
>>
>> The mount_namespaces(7) page has for a log time had this text:
>>
>>        *  Mounts that come as a single unit from a more privileged mount
>>           namespace are locked together and may not be separated in a
>>           less privileged mount namespace.  (The unshare(2) CLONE_NEWNS
>>           operation brings across all of the mounts from the original
>>           mount namespace as a single unit, and recursive mounts that
>>           propagate between mount namespaces propagate as a single unit.)
>>
>> I have had trouble understanding that. But maybe you just helped.
>> Is that text relevant to what you just wrote above? In particular,
>> I have trouble understanding what "separated" means. But, perhaps
> 
> The text gives the "how" not the "why".

Yes, that's a big problem :-}.

> Consider a more elaborate mount tree where e.g., you have bind-mounted a
> mount over a subdirectory of another mount:
> 
> sudo mount -t tmpfs /mnt
> sudo mkdir /mnt/my-dir/
> sudo touch /mnt/my-dir/my-file
> sudo mount --bind /opt /mnt/my-dir
> 
> The files underneath /mnt/my-dir are now hidden. Consider what would
> happen if one would allow to address those mounts separately. A user
> could then do:
> 
> unshare -U --map-root --mount
> umount /mnt/my-dir
> cat /mnt/my-dir/my-file
> 
> giving them access to what's in my-dir.
> 
> Treating such mount trees as a unit in less privileged mount namespaces
> (cf. [1]) prevents that, i.e., prevents revealing files and directories
> that were overmounted.

Got it!
 
> Treating such mounts as a unit is also relevant when e.g. bind-mounting
> a mount tree containing locked mounts. Sticking with the example above:
> 
> unshare -U --map-root --mount
> 
> # non-recursive bind-mount will fail
> mount --bind /mnt /tmp
> 
> # recursive bind-mount will succeed
> mount --rbind /mnt /tmp
> 
> The reason is again that the mount tree at /mnt is treated as a mount
> unit because it is locked. If one were to allow to non-recursively
> bind-mountng /mnt somewhere it would mean revealing what's underneath
> the mount at my-dir (This is in some sense the inverse of preventing a
> filesystem from being mounted that isn't fully visible, i.e. contains
> hidden or over-mounted mounts.).

Got it!

> These semantics, in addition to being security relevant, also allow a
> more privileged mount namespace to create a restricted view of the
> filesystem hierarchy that can't be circumvented in a less privileged
> mount namespace (Otherwise pivot_root would have to be used which can
> also be used to guarantee a restriced view on the filesystem hierarchy
> especially when combined with a separate rootfs.).

Okay.

Christian, thanks for so generously taking the time to write this up.
It really helped me a lot! I will do some work on the mount namespaces
manual page, to cover at least part of what you said.

Thanks,

Michael

> Christian
> 
> [1]: I'll avoid jumping through the hoops of speaking about ownership
>      all the time now for the sake of brevity. Otherwise I'll still sit
>      here at lunchtime.
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-08-13  1:25 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-10  1:38 Questions re the new mount_setattr(2) manual page Michael Kerrisk (man-pages)
2021-08-10  7:12 ` Michael Kerrisk (man-pages)
2021-08-10 14:11   ` Christian Brauner
2021-08-10 19:30     ` Michael Kerrisk (man-pages)
2021-08-10 14:32 ` Christian Brauner
2021-08-10 21:06   ` Michael Kerrisk (man-pages)
2021-08-11 10:07     ` Christian Brauner
2021-08-12  5:36       ` Michael Kerrisk (man-pages)
2021-08-12  9:08         ` Christian Brauner
2021-08-12 22:32           ` Michael Kerrisk (man-pages)
2021-08-10 22:47 ` Michael Kerrisk (man-pages)
2021-08-11 10:40   ` Christian Brauner
2021-08-12  5:36     ` Michael Kerrisk (man-pages)
2021-08-12  8:38       ` Christian Brauner
2021-08-13  1:25         ` Michael Kerrisk (man-pages)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.