linux-man.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] document openat2(2) patch series
@ 2019-10-03 14:55 Aleksa Sarai
  2019-10-03 14:55 ` [PATCH RFC 1/3] symlink.7: document magic-links more completely Aleksa Sarai
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Aleksa Sarai @ 2019-10-03 14:55 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Aleksa Sarai, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel

This is a first draft of the man-page changes for the openat2(2) patch
series I'm working on[1]. It includes information about the magic-link
changes as well as the primary new features (O_EMPTYPATH and openat2).

Let me know what you think. I might go into too much detail about how
extension of openat2(2) will work -- let me know if that section should
be dropped (while it is useful for userspace to understand, it isn't
really that necessary to explain exactly what the semantics are -- it
will usually just transparently work).

[1]: https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/

Aleksa Sarai (3):
  symlink.7: document magic-links more completely
  open.2: add O_EMPTYPATH documentation
  openat2.2: document new openat2(2) syscall

 man2/open.2            |  47 ++++-
 man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
 man7/path_resolution.7 |  89 ++++++++--
 man7/symlink.7         |  39 ++++-
 4 files changed, 528 insertions(+), 28 deletions(-)
 create mode 100644 man2/openat2.2

-- 
2.23.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/3] symlink.7: document magic-links more completely
  2019-10-03 14:55 [PATCH RFC 0/3] document openat2(2) patch series Aleksa Sarai
@ 2019-10-03 14:55 ` Aleksa Sarai
  2019-10-07 16:36   ` Jann Horn
                     ` (2 more replies)
  2019-10-03 14:55 ` [PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation Aleksa Sarai
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 18+ messages in thread
From: Aleksa Sarai @ 2019-10-03 14:55 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Aleksa Sarai, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel

Traditionally, magic-links have not been a well-understood topic in
Linux. Given the new changes in their semantics (related to the link
mode of trailing magic-links), it seems like a good opportunity to shine
more light on magic-links and their semantics.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man7/path_resolution.7 | 15 +++++++++++++++
 man7/symlink.7         | 39 ++++++++++++++++++++++++++++++---------
 2 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
index 07664ed8faec..46f25ec4cdfa 100644
--- a/man7/path_resolution.7
+++ b/man7/path_resolution.7
@@ -136,6 +136,21 @@ we are just creating it.
 The details on the treatment
 of the final entry are described in the manual pages of the specific
 system calls.
+.PP
+Since Linux 5.FOO, if the final entry is a "magic-link" (see
+.BR symlink (7)),
+and the user is attempting to
+.BR open (2)
+it, then there is an additional permission-related restriction applied to the
+operation: the requested access mode must not exceed the "link mode" of the
+magic-link (unlike ordinary symlinks, magic-links have their own file mode.)
+For example, if
+.I /proc/[pid]/fd/[num]
+has a link mode of
+.BR 0500 ,
+unprivileged users are not permitted to
+.BR open ()
+the magic-link for writing.
 .SS . and ..
 By convention, every directory has the entries "." and "..",
 which refer to the directory itself and to its parent directory,
diff --git a/man7/symlink.7 b/man7/symlink.7
index 9f5bddd5dc21..33f0ec703acd 100644
--- a/man7/symlink.7
+++ b/man7/symlink.7
@@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
 are outlined here.
 It is important that site-local applications also conform to these rules,
 so that the user interface can be as consistent as possible.
+.SS Magic-links
+There is a special class of symlink-like objects known as "magic-links" which
+can be found in certain pseudo-filesystems such as
+.BR proc (5)
+(examples include
+.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
+Unlike normal symlinks, magic-links are not resolved through
+pathname-expansion, but instead act as direct references to the kernel's own
+representation of a file handle. As such, these magic-links allow users to
+access files which cannot be referenced with normal paths (such as unlinked
+files still referenced by a running program.)
+.PP
+Because they can bypass ordinary
+.BR mount_namespaces (7)-based
+restrictions, magic-links have been used as attack vectors in various exploits.
+As such (since Linux 5.FOO), there are additional restrictions placed on the
+re-opening of magic-links (see
+.BR path_resolution (7)
+for more details.)
 .SS Symbolic link ownership, permissions, and timestamps
 The owner and group of an existing symbolic link can be changed
 using
@@ -99,16 +118,18 @@ of a symbolic link can be changed using
 or
 .BR lutimes (3).
 .PP
-On Linux, the permissions of a symbolic link are not used
-in any operations; the permissions are always
-0777 (read, write, and execute for all user categories),
 .\" Linux does not currently implement an lchmod(2).
-and can't be changed.
-(Note that there are some "magic" symbolic links in the
-.I /proc
-directory tree\(emfor example, the
-.IR /proc/[pid]/fd/*
-files\(emthat have different permissions.)
+On Linux, the permissions of an ordinary symbolic link are not used in any
+operations; the permissions are always 0777 (read, write, and execute for all
+user categories), and can't be changed.
+.PP
+However, magic-links do not follow this rule. They can have a non-0777 mode,
+which is used for permission checks when the final
+component of an
+.BR open (2)'s
+path is a magic-link (see
+.BR path_resolution (7).)
+
 .\"
 .\" The
 .\" 4.4BSD
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation
  2019-10-03 14:55 [PATCH RFC 0/3] document openat2(2) patch series Aleksa Sarai
  2019-10-03 14:55 ` [PATCH RFC 1/3] symlink.7: document magic-links more completely Aleksa Sarai
@ 2019-10-03 14:55 ` Aleksa Sarai
  2019-10-09  8:01   ` Michael Kerrisk (man-pages)
  2019-10-03 14:55 ` [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall Aleksa Sarai
  2019-10-03 14:55 ` [PATCH RFC 3/3] openat2.2: document new syscall Aleksa Sarai
  3 siblings, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2019-10-03 14:55 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Aleksa Sarai, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel

Some of the wording around empty paths in path_resolution(7) also needed
to be reworked since it's now legal (if you pass O_EMPTYPATH).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man2/open.2            | 42 +++++++++++++++++++++++++++++++++++++++++-
 man7/path_resolution.7 | 17 ++++++++++++++++-
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/man2/open.2 b/man2/open.2
index b0f485b41589..7217fe056e5e 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -48,7 +48,7 @@
 .\" FIXME . Apr 08: The next POSIX revision has O_EXEC, O_SEARCH, and
 .\" O_TTYINIT.  Eventually these may need to be documented.  --mtk
 .\"
-.TH OPEN 2 2018-04-30 "Linux" "Linux Programmer's Manual"
+.TH OPEN 2 2019-10-03 "Linux" "Linux Programmer's Manual"
 .SH NAME
 open, openat, creat \- open and possibly create a file
 .SH SYNOPSIS
@@ -421,6 +421,21 @@ was followed by a call to
 .BR fdatasync (2)).
 .IR "See NOTES below" .
 .TP
+.BR O_EMPTYPATH " (since Linux 5.FOO)"
+If \fIpathname\fP is an empty string, re-open the the file descriptor given as
+the \fIdirfd\fP argument to
+.BR openat (2).
+This can be used with both ordinary (file and directory) and \fBO_PATH\fP file
+descriptors, but cannot be used with
+.BR AT_FDCWD
+(or as an argument to plain
+.BR open (2).) When re-opening an \fBO_PATH\fP file descriptor, the same "link
+mode" restrictions apply as with re-opening through
+.BR proc (5)
+(see
+.BR path_resolution "(7) and " symlink (7)
+for more details.)
+.TP
 .B O_EXCL
 Ensure that this call creates the file:
 if this flag is specified in conjunction with
@@ -668,6 +683,13 @@ with
 (or via procfs using
 .BR AT_SYMLINK_FOLLOW )
 even if the file is not a directory.
+You can even "re-open" (or upgrade) an
+.BR O_PATH
+file descriptor by using
+.BR O_EMPTYPATH
+(see the section for
+.BR O_EMPTYPATH
+for more details.)
 .IP *
 Passing the file descriptor to another process via a UNIX domain socket
 (see
@@ -958,6 +980,15 @@ is not allowed.
 (See also
 .BR path_resolution (7).)
 .TP
+.B EBADF
+.I pathname
+was an empty string (and
+.B O_EMPTYPATH
+was passed) with
+.BR open (2)
+(instead of
+.BR openat (2).)
+.TP
 .B EDQUOT
 Where
 .B O_CREAT
@@ -1203,6 +1234,15 @@ The following additional errors can occur for
 .I dirfd
 is not a valid file descriptor.
 .TP
+.B EBADF
+.I pathname
+was an empty string (and
+.B O_EMPTYPATH
+was passed), but the provided
+.I dirfd
+was an invalid file descriptor (or was
+.BR AT_FDCWD .)
+.TP
 .B ENOTDIR
 .I pathname
 is a relative pathname and
diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
index 46f25ec4cdfa..85dd354e9a93 100644
--- a/man7/path_resolution.7
+++ b/man7/path_resolution.7
@@ -22,7 +22,7 @@
 .\" the source, must acknowledge the copyright and authors of this work.
 .\" %%%LICENSE_END
 .\"
-.TH PATH_RESOLUTION 7 2017-11-26 "Linux" "Linux Programmer's Manual"
+.TH PATH_RESOLUTION 7 2019-10-03 "Linux" "Linux Programmer's Manual"
 .SH NAME
 path_resolution \- how a pathname is resolved to a file
 .SH DESCRIPTION
@@ -198,6 +198,21 @@ successfully.
 Linux returns
 .B ENOENT
 in this case.
+.PP
+As of Linux 5.FOO, an empty path argument can be used to indicate the "re-open"
+an existing file descriptor if
+.B O_EMPTYPATH
+is passed as a flag argument to
+.BR openat (2),
+with the
+.I dfd
+argument indicating which file descriptor to "re-open". This is approximately
+equivalent to opening
+.I /proc/self/fd/$fd
+where
+.I $fd
+is the open file descriptor to be "re-opened".
+
 .SS Permissions
 The permission bits of a file consist of three groups of three bits; see
 .BR chmod (1)
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall
  2019-10-03 14:55 [PATCH RFC 0/3] document openat2(2) patch series Aleksa Sarai
  2019-10-03 14:55 ` [PATCH RFC 1/3] symlink.7: document magic-links more completely Aleksa Sarai
  2019-10-03 14:55 ` [PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation Aleksa Sarai
@ 2019-10-03 14:55 ` Aleksa Sarai
  2019-10-09  8:36   ` Michael Kerrisk (man-pages)
  2019-10-03 14:55 ` [PATCH RFC 3/3] openat2.2: document new syscall Aleksa Sarai
  3 siblings, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2019-10-03 14:55 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Aleksa Sarai, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel

Rather than trying to merge the new syscall documentation into open.2
(which would probably result in the man-page being incomprehensible),
instead the new syscall gets its own dedicated page with links between
open(2) and openat2(2) to avoid duplicating information such as the list
of O_* flags or common errors.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man2/open.2            |   5 +
 man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
 man7/path_resolution.7 |  57 ++++--
 3 files changed, 426 insertions(+), 17 deletions(-)
 create mode 100644 man2/openat2.2

diff --git a/man2/open.2 b/man2/open.2
index 7217fe056e5e..a0b43394bbee 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
 ", mode_t " mode );
+.PP
+/* Docuented separately, in \fBopenat2\fP(2). */
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
 .fi
 .PP
 .in -4n
@@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
 .B O_DIRECTORY
 is ignored).
 .SH SEE ALSO
+.BR openat2 (2),
 .BR chmod (2),
 .BR chown (2),
 .BR close (2),
diff --git a/man2/openat2.2 b/man2/openat2.2
new file mode 100644
index 000000000000..c43c76046243
--- /dev/null
+++ b/man2/openat2.2
@@ -0,0 +1,381 @@
+.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
+.SH NAME
+openat2 \- open and possibly create a file (extended)
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.B #include <sys/stat.h>
+.B #include <fcntl.h>
+.PP
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR openat2 ()
+system call is an extension of
+.BR openat (2)
+and provides a superset of its functionality. Rather than taking a single
+.I flag
+argument, an extensible structure (\fIhow\fP) is passed instead to allow for
+seamless future extensions.
+.PP
+.I size
+must be set to
+.IR "sizeof(struct open_how)" ,
+to facilitate future extensions (see the "Extensibility" section of the
+\fBNOTES\fP for more detail on how extensions are handled.)
+
+.SS The open_how structure
+The following structure indicates how
+.I pathname
+should be opened, and acts as a superset of the
+.IR flag " and " mode
+arguments to
+.BR openat (2).
+.PP
+.in +4n
+.EX
+struct open_how {
+    uint32_t flags;              /* open(2)-style O_* flags. */
+    union {
+        uint16_t mode;           /* File mode bits for new file creation. */
+        uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
+    };
+    uint32_t resolve;            /* RESOLVE_* path-resolution flags. */
+};
+.EE
+.in
+.PP
+Any future extensions to
+.BR openat2 ()
+will be implemented as new fields appended to the above structure, with the
+zero value of the new fields acting as though the extension were not present.
+.PP
+The meaning of each field is as follows:
+.RS
+
+.I flags
+.RS
+The file creation and status flags to use for this operation. All of the
+.B O_*
+flags defined for
+.BR openat (2)
+are valid
+.BR openat2 ()
+flag values.
+.RE
+
+.I upgrade_mask
+.RS
+Restrict with which
+.I access modes
+the returned
+.B O_PATH
+descriptor may be re-opened (either through
+.B O_EMPTYPATH
+or
+.IR /proc/self/fd/ .)
+This field may only be set to a non-zero value if
+.I flags
+contains
+.BR O_PATH .
+By default, an
+.B O_PATH
+file descriptor of an ordinary file may be re-opened with with any access mode (but an
+.B O_PATH
+file descriptor of a magic-link may only be re-opened with access modes that
+the original magic-link possessed). The full list of
+.I upgrade_mask
+flags is given below.
+.TP
+.B UPGRADE_NOREAD
+Do not permit the
+.B O_PATH
+file descriptor to be re-opened for reading (i.e.
+.BR O_RDONLY " or " O_RDWR .)
+.TP
+.B UPGRADE_NOWRITE
+Do not permit the
+.B O_PATH
+file descriptor to be re-opened for writing (i.e.
+.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
+.RE
+
+.I resolve
+.RS
+Change how the components of
+.I pathname
+will be resolved (see
+.BR path_resolution (7)
+for background information.) The primary use-case for these flags is to allow
+trusted programs to restrict how un-trusted paths (or paths inside un-trusted
+directories) are resolved. The full list of
+.I resolve
+flags is given below.
+.TP
+.B RESOLVE_NO_XDEV
+Disallow all mount-point crossings during path resolution (including
+all bind-mounts).
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as bind-mounts are very widely used by
+end-users and thus enabling this flag globally may result in spurious errors on
+some systems.
+.TP
+.B RESOLVE_NO_SYMLINKS
+Disallow all symlink resolution during path resolution. If the trailing
+component is a symlink, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the symlink will be returned. This option implies
+.BR RESOLVE_NO_MAGICLINKS .
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as symlinks are very widely used by
+end-users and thus enabling this flag globally may result in spurious errors on
+some systems.
+.TP
+.B RESOLVE_NO_MAGICLINKS
+Disallow all magic-link resolution during path resolution. If the trailing
+component is a magic-link, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the magic-link will be returned.
+
+Magic-links are symlink-like objects that are most notably found in
+.BR proc (5)
+(examples include
+.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
+Due to the potential danger of unknowingly opening these magic-links, it may be
+preferable for users to disable their resolution entirely (see
+.BR symlink (7)
+for more details.)
+.TP
+.B RESOLVE_BENEATH
+Do not permit the path resolution to succeed if any component of the resolution
+is not a descendant of the directory indicated by
+.IR dirfd .
+This results in absolute symlinks (and absolute values of
+.IR pathname )
+to be rejected. Magic-link resolution is also not permitted.
+
+.TP
+.B RESOLVE_IN_ROOT
+Temporarily treat
+.I dirfd
+as the root of the filesystem (as though the user called
+.BR chroot (2)
+with
+.IR dirfd
+as the argument.) Absolute symlinks and ".." path components will be scoped to
+.IR dirfd . Magic-link resolution is also not permitted.
+
+However, unlike
+.BR chroot (2)
+(which changes the filesystem root persistently for an entire thread-group),
+.B RESOLVE_IN_ROOT
+allows a program to efficiently restrict path resolution for only certain
+operations. It also has several hardening features (such as not permitting
+magic-link resolution) which
+.BR chroot (2)
+does not.
+.RE
+
+.RE
+
+.PP
+Unlike
+.BR openat (2),
+any unknown flags set in fields of
+.I how
+will result in an error, rather than being ignored. In addition, an error will
+be returned if the value of the
+.IR mode " and " upgrade_mask
+union is non-zero unless:
+.RS
+.IP * 3
+.I flags
+indicates that a new file will be created (it contains
+.BR O_CREAT " or " O_TMPFILE ),
+in which case
+.I mode
+may be any valid file mode.
+.IP *
+.I flags
+contains
+.BR O_PATH ,
+in which case
+.I upgrade_mask
+must only contain valid
+.B UPGRADE_*
+flags.
+.RE
+
+.SH RETURN VALUE
+On success, a new file descriptor is returned. On error, -1 is returned, and
+.I errno
+is set appropriately.
+
+.SH ERRORS
+The set of errors returned by
+.BR openat2 ()
+includes all of the errors returned by
+.BR openat (2),
+as well as the following additional errors:
+.TP
+.B EINVAL
+An unknown flag or invalid value was specified in
+.IR how .
+.TP
+.B EINVAL
+.I size
+was smaller than any known version of
+.IR "struct open_how" .
+.TP
+.B E2BIG
+An extension was specified in
+.IR how ,
+which the current kernel does not support (see the "Extensibility" section of
+the \fBNOTES\fP for more detail on how extensions are handled.)
+.TP
+.B EAGAIN
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and the kernel could not ensure that a ".." component didn't escape (due to a
+race condition or potential attack). Callers may choose to retry the
+.BR openat2 ()
+call.
+.TP
+.B EXDEV
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and a path component attempted to escape the root of the resolution.
+
+.TP
+.B EXDEV
+.I resolve
+contains
+.BR RESOLVE_NO_XDEV ,
+and a path component attempted to cross a mount-point.
+
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_SYMLINKS ,
+and one of the path components was a symlink.
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_MAGICLINKS ,
+and one of the path components was a magic-link.
+
+.SH VERSIONS
+.BR openat2 ()
+was added to Linux in kernel 5.FOO.
+
+.SH CONFORMING TO
+This system call is Linux-specific.
+
+The semantics of
+.B RESOLVE_BENEATH
+were modelled after FreeBSD's
+.BR O_BENEATH .
+
+.SH NOTES
+Glibc does not provide a wrapper for this system call; call it using
+.BR syscall (2).
+
+.SS Extensibility
+In order to allow for
+.I struct open_how
+to be extended in future kernel revisions,
+.BR openat2 ()
+requires userspace to specify what sized
+.I struct open_how
+structure they are passing. By providing this information, it is possible for
+.BR openat2 ()
+to provide both forwards- and backwards-compatibility \(em with
+.I size
+acting as an implicit version number (because new extension fields will always
+be appended, the size will always increase.) This extensibility design is very
+similar to other system calls such as
+.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
+
+If we let
+.I usize
+be the size of the structure according to userspace and
+.I ksize
+be the size of the structure which the kernel supports, then there are only
+three cases to consider:
+
+.RS
+.IP * 3
+If
+.IR ksize " equals " usize ,
+then there is no version mismatch and
+.I how
+can be used verbatim.
+.IP *
+If
+.IR ksize " is larger than " usize ,
+then there are some extensions the kernel supports which the userspace program
+is unaware of. Because all extensions must have their zero values be a no-op,
+the kernel treats all of the extension fields not set by userspace to have zero
+values. This provides backwards-compatibility.
+.IP *
+If
+.IR ksize " is smaller than " usize ,
+then there are some extensions which the userspace program is aware of but the
+kernel does not support. Because all extensions must have their zero values be
+a no-op, the kernel can safely ignore the unsupported extension fields if they
+are all-zero. If any unsupported extension fields are non-zero, then an error
+is returned. This provides forwards-compatibility.
+.RE
+
+Therefore, most userspace programs will not need to have any special handling
+of extensions. However, if a userspace program wishes to determine what
+extensions the running kernel supports, they may conduct a binary search on
+.IR size
+(to find the largest value which doesn't produce an error.)
+
+.SH SEE ALSO
+.BR openat (2),
+.BR path_resolution (7),
+.BR symlink (7)
diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
index 85dd354e9a93..3da3e5b614c8 100644
--- a/man7/path_resolution.7
+++ b/man7/path_resolution.7
@@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
 Some UNIX/Linux system calls have as parameter one or more filenames.
 A filename (or pathname) is resolved as follows.
 .SS Step 1: start of the resolution process
-If the pathname starts with the \(aq/\(aq character,
-the starting lookup directory
-is the root directory of the calling process.
-(A process inherits its
-root directory from its parent.
-Usually this will be the root directory
-of the file hierarchy.
-A process may get a different root directory
-by use of the
+If the pathname starts with the \(aq/\(aq character, the starting lookup
+directory is the root directory of the calling process. (A process inherits its
+root directory from its parent. Usually this will be the root directory of the
+file hierarchy. A process may get a different root directory by use of the
 .BR chroot (2)
-system call.
+system call, or may temporarily use a different root directory by using
+.BR openat2 (2)
+with the
+.B RESOLVE_IN_ROOT
+flag set.
+.PP
 A process may get an entirely private mount namespace in case
 it\(emor one of its ancestors\(emwas started by an invocation of the
 .BR clone (2)
@@ -48,16 +48,24 @@ system call that had the
 flag set.)
 This handles the \(aq/\(aq part of the pathname.
 .PP
-If the pathname does not start with the \(aq/\(aq character, the
-starting lookup directory of the resolution process is the current working
-directory of the process.
-(This is also inherited from the parent.
-It can be changed by use of the
+If the pathname does not start with the \(aq/\(aq character, the starting
+lookup directory of the resolution process is the current working directory of
+the process \(em or in the case of
+.BR openat (2)-style
+syscalls, the
+.I dfd
+argument (or the current working directory if
+.B AT_FDCWD
+is passed as the
+.I dfd
+argumnet). The current working directory is inherited from the parent, and can
+be changed by use of the
 .BR chdir (2)
-system call.)
+syscall.
 .PP
 Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
 Pathnames not starting with a \(aq/\(aq are called relative pathnames.
+
 .SS Step 2: walk along the path
 Set the current lookup directory to the starting lookup directory.
 Now, for each nonfinal component of the pathname, where a component
@@ -124,6 +132,13 @@ the kernel's pathname-resolution code
 was reworked to eliminate the use of recursion,
 so that the only limit that remains is the maximum of 40
 resolutions for the entire pathname.
+.PP
+The resolution of syscalls during this stage can be blocked by using
+.BR openat2 (2),
+with the
+.B RESOLVE_NO_SYMLINKS
+flag set.
+
 .SS Step 3: find the final entry
 The lookup of the final component of the pathname goes just like
 that of all other components, as described in the previous step,
@@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
 their conventional meanings, regardless of whether they are
 actually present in the physical filesystem.
 .PP
-One cannot walk down past the root: "/.." is the same as "/".
+One cannot walk up past the root: "/.." is the same as "/".
+
 .SS Mount points
 After a "mount dev path" command, the pathname "path" refers to
 the root of the filesystem hierarchy on the device "dev", and no
@@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
 One can walk out of a mounted filesystem: "path/.." refers to
 the parent directory of "path",
 outside of the filesystem hierarchy on "dev".
+.PP
+Mount-point crossings can be blocked by using
+.BR openat2 (2),
+with the
+.B RESOLVE_NO_XDEV
+flag set (though note that this also restricts bind-mount crossings).
+
 .SS Trailing slashes
 If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
 component as in Step 2: it has to exist and resolve to a directory.
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 3/3] openat2.2: document new syscall
  2019-10-03 14:55 [PATCH RFC 0/3] document openat2(2) patch series Aleksa Sarai
                   ` (2 preceding siblings ...)
  2019-10-03 14:55 ` [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall Aleksa Sarai
@ 2019-10-03 14:55 ` Aleksa Sarai
  2019-10-03 15:00   ` Aleksa Sarai
  3 siblings, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2019-10-03 14:55 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Aleksa Sarai, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man2/open.2            |   5 +
 man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
 man7/path_resolution.7 |  57 ++++--
 3 files changed, 426 insertions(+), 17 deletions(-)
 create mode 100644 man2/openat2.2

diff --git a/man2/open.2 b/man2/open.2
index 7217fe056e5e..a0b43394bbee 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
 ", mode_t " mode );
+.PP
+/* Docuented separately, in \fBopenat2\fP(2). */
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
 .fi
 .PP
 .in -4n
@@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
 .B O_DIRECTORY
 is ignored).
 .SH SEE ALSO
+.BR openat2 (2),
 .BR chmod (2),
 .BR chown (2),
 .BR close (2),
diff --git a/man2/openat2.2 b/man2/openat2.2
new file mode 100644
index 000000000000..c43c76046243
--- /dev/null
+++ b/man2/openat2.2
@@ -0,0 +1,381 @@
+.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
+.SH NAME
+openat2 \- open and possibly create a file (extended)
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.B #include <sys/stat.h>
+.B #include <fcntl.h>
+.PP
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR openat2 ()
+system call is an extension of
+.BR openat (2)
+and provides a superset of its functionality. Rather than taking a single
+.I flag
+argument, an extensible structure (\fIhow\fP) is passed instead to allow for
+seamless future extensions.
+.PP
+.I size
+must be set to
+.IR "sizeof(struct open_how)" ,
+to facilitate future extensions (see the "Extensibility" section of the
+\fBNOTES\fP for more detail on how extensions are handled.)
+
+.SS The open_how structure
+The following structure indicates how
+.I pathname
+should be opened, and acts as a superset of the
+.IR flag " and " mode
+arguments to
+.BR openat (2).
+.PP
+.in +4n
+.EX
+struct open_how {
+    uint32_t flags;              /* open(2)-style O_* flags. */
+    union {
+        uint16_t mode;           /* File mode bits for new file creation. */
+        uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
+    };
+    uint32_t resolve;            /* RESOLVE_* path-resolution flags. */
+};
+.EE
+.in
+.PP
+Any future extensions to
+.BR openat2 ()
+will be implemented as new fields appended to the above structure, with the
+zero value of the new fields acting as though the extension were not present.
+.PP
+The meaning of each field is as follows:
+.RS
+
+.I flags
+.RS
+The file creation and status flags to use for this operation. All of the
+.B O_*
+flags defined for
+.BR openat (2)
+are valid
+.BR openat2 ()
+flag values.
+.RE
+
+.I upgrade_mask
+.RS
+Restrict with which
+.I access modes
+the returned
+.B O_PATH
+descriptor may be re-opened (either through
+.B O_EMPTYPATH
+or
+.IR /proc/self/fd/ .)
+This field may only be set to a non-zero value if
+.I flags
+contains
+.BR O_PATH .
+By default, an
+.B O_PATH
+file descriptor of an ordinary file may be re-opened with with any access mode (but an
+.B O_PATH
+file descriptor of a magic-link may only be re-opened with access modes that
+the original magic-link possessed). The full list of
+.I upgrade_mask
+flags is given below.
+.TP
+.B UPGRADE_NOREAD
+Do not permit the
+.B O_PATH
+file descriptor to be re-opened for reading (i.e.
+.BR O_RDONLY " or " O_RDWR .)
+.TP
+.B UPGRADE_NOWRITE
+Do not permit the
+.B O_PATH
+file descriptor to be re-opened for writing (i.e.
+.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
+.RE
+
+.I resolve
+.RS
+Change how the components of
+.I pathname
+will be resolved (see
+.BR path_resolution (7)
+for background information.) The primary use-case for these flags is to allow
+trusted programs to restrict how un-trusted paths (or paths inside un-trusted
+directories) are resolved. The full list of
+.I resolve
+flags is given below.
+.TP
+.B RESOLVE_NO_XDEV
+Disallow all mount-point crossings during path resolution (including
+all bind-mounts).
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as bind-mounts are very widely used by
+end-users and thus enabling this flag globally may result in spurious errors on
+some systems.
+.TP
+.B RESOLVE_NO_SYMLINKS
+Disallow all symlink resolution during path resolution. If the trailing
+component is a symlink, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the symlink will be returned. This option implies
+.BR RESOLVE_NO_MAGICLINKS .
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as symlinks are very widely used by
+end-users and thus enabling this flag globally may result in spurious errors on
+some systems.
+.TP
+.B RESOLVE_NO_MAGICLINKS
+Disallow all magic-link resolution during path resolution. If the trailing
+component is a magic-link, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the magic-link will be returned.
+
+Magic-links are symlink-like objects that are most notably found in
+.BR proc (5)
+(examples include
+.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
+Due to the potential danger of unknowingly opening these magic-links, it may be
+preferable for users to disable their resolution entirely (see
+.BR symlink (7)
+for more details.)
+.TP
+.B RESOLVE_BENEATH
+Do not permit the path resolution to succeed if any component of the resolution
+is not a descendant of the directory indicated by
+.IR dirfd .
+This results in absolute symlinks (and absolute values of
+.IR pathname )
+to be rejected. Magic-link resolution is also not permitted.
+
+.TP
+.B RESOLVE_IN_ROOT
+Temporarily treat
+.I dirfd
+as the root of the filesystem (as though the user called
+.BR chroot (2)
+with
+.IR dirfd
+as the argument.) Absolute symlinks and ".." path components will be scoped to
+.IR dirfd . Magic-link resolution is also not permitted.
+
+However, unlike
+.BR chroot (2)
+(which changes the filesystem root persistently for an entire thread-group),
+.B RESOLVE_IN_ROOT
+allows a program to efficiently restrict path resolution for only certain
+operations. It also has several hardening features (such as not permitting
+magic-link resolution) which
+.BR chroot (2)
+does not.
+.RE
+
+.RE
+
+.PP
+Unlike
+.BR openat (2),
+any unknown flags set in fields of
+.I how
+will result in an error, rather than being ignored. In addition, an error will
+be returned if the value of the
+.IR mode " and " upgrade_mask
+union is non-zero unless:
+.RS
+.IP * 3
+.I flags
+indicates that a new file will be created (it contains
+.BR O_CREAT " or " O_TMPFILE ),
+in which case
+.I mode
+may be any valid file mode.
+.IP *
+.I flags
+contains
+.BR O_PATH ,
+in which case
+.I upgrade_mask
+must only contain valid
+.B UPGRADE_*
+flags.
+.RE
+
+.SH RETURN VALUE
+On success, a new file descriptor is returned. On error, -1 is returned, and
+.I errno
+is set appropriately.
+
+.SH ERRORS
+The set of errors returned by
+.BR openat2 ()
+includes all of the errors returned by
+.BR openat (2),
+as well as the following additional errors:
+.TP
+.B EINVAL
+An unknown flag or invalid value was specified in
+.IR how .
+.TP
+.B EINVAL
+.I size
+was smaller than any known version of
+.IR "struct open_how" .
+.TP
+.B E2BIG
+An extension was specified in
+.IR how ,
+which the current kernel does not support (see the "Extensibility" section of
+the \fBNOTES\fP for more detail on how extensions are handled.)
+.TP
+.B EAGAIN
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and the kernel could not ensure that a ".." component didn't escape (due to a
+race condition or potential attack). Callers may choose to retry the
+.BR openat2 ()
+call.
+.TP
+.B EXDEV
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and a path component attempted to escape the root of the resolution.
+
+.TP
+.B EXDEV
+.I resolve
+contains
+.BR RESOLVE_NO_XDEV ,
+and a path component attempted to cross a mount-point.
+
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_SYMLINKS ,
+and one of the path components was a symlink.
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_MAGICLINKS ,
+and one of the path components was a magic-link.
+
+.SH VERSIONS
+.BR openat2 ()
+was added to Linux in kernel 5.FOO.
+
+.SH CONFORMING TO
+This system call is Linux-specific.
+
+The semantics of
+.B RESOLVE_BENEATH
+were modelled after FreeBSD's
+.BR O_BENEATH .
+
+.SH NOTES
+Glibc does not provide a wrapper for this system call; call it using
+.BR syscall (2).
+
+.SS Extensibility
+In order to allow for
+.I struct open_how
+to be extended in future kernel revisions,
+.BR openat2 ()
+requires userspace to specify what sized
+.I struct open_how
+structure they are passing. By providing this information, it is possible for
+.BR openat2 ()
+to provide both forwards- and backwards-compatibility \(em with
+.I size
+acting as an implicit version number (because new extension fields will always
+be appended, the size will always increase.) This extensibility design is very
+similar to other system calls such as
+.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
+
+If we let
+.I usize
+be the size of the structure according to userspace and
+.I ksize
+be the size of the structure which the kernel supports, then there are only
+three cases to consider:
+
+.RS
+.IP * 3
+If
+.IR ksize " equals " usize ,
+then there is no version mismatch and
+.I how
+can be used verbatim.
+.IP *
+If
+.IR ksize " is larger than " usize ,
+then there are some extensions the kernel supports which the userspace program
+is unaware of. Because all extensions must have their zero values be a no-op,
+the kernel treats all of the extension fields not set by userspace to have zero
+values. This provides backwards-compatibility.
+.IP *
+If
+.IR ksize " is smaller than " usize ,
+then there are some extensions which the userspace program is aware of but the
+kernel does not support. Because all extensions must have their zero values be
+a no-op, the kernel can safely ignore the unsupported extension fields if they
+are all-zero. If any unsupported extension fields are non-zero, then an error
+is returned. This provides forwards-compatibility.
+.RE
+
+Therefore, most userspace programs will not need to have any special handling
+of extensions. However, if a userspace program wishes to determine what
+extensions the running kernel supports, they may conduct a binary search on
+.IR size
+(to find the largest value which doesn't produce an error.)
+
+.SH SEE ALSO
+.BR openat (2),
+.BR path_resolution (7),
+.BR symlink (7)
diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
index 85dd354e9a93..3da3e5b614c8 100644
--- a/man7/path_resolution.7
+++ b/man7/path_resolution.7
@@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
 Some UNIX/Linux system calls have as parameter one or more filenames.
 A filename (or pathname) is resolved as follows.
 .SS Step 1: start of the resolution process
-If the pathname starts with the \(aq/\(aq character,
-the starting lookup directory
-is the root directory of the calling process.
-(A process inherits its
-root directory from its parent.
-Usually this will be the root directory
-of the file hierarchy.
-A process may get a different root directory
-by use of the
+If the pathname starts with the \(aq/\(aq character, the starting lookup
+directory is the root directory of the calling process. (A process inherits its
+root directory from its parent. Usually this will be the root directory of the
+file hierarchy. A process may get a different root directory by use of the
 .BR chroot (2)
-system call.
+system call, or may temporarily use a different root directory by using
+.BR openat2 (2)
+with the
+.B RESOLVE_IN_ROOT
+flag set.
+.PP
 A process may get an entirely private mount namespace in case
 it\(emor one of its ancestors\(emwas started by an invocation of the
 .BR clone (2)
@@ -48,16 +48,24 @@ system call that had the
 flag set.)
 This handles the \(aq/\(aq part of the pathname.
 .PP
-If the pathname does not start with the \(aq/\(aq character, the
-starting lookup directory of the resolution process is the current working
-directory of the process.
-(This is also inherited from the parent.
-It can be changed by use of the
+If the pathname does not start with the \(aq/\(aq character, the starting
+lookup directory of the resolution process is the current working directory of
+the process \(em or in the case of
+.BR openat (2)-style
+syscalls, the
+.I dfd
+argument (or the current working directory if
+.B AT_FDCWD
+is passed as the
+.I dfd
+argumnet). The current working directory is inherited from the parent, and can
+be changed by use of the
 .BR chdir (2)
-system call.)
+syscall.
 .PP
 Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
 Pathnames not starting with a \(aq/\(aq are called relative pathnames.
+
 .SS Step 2: walk along the path
 Set the current lookup directory to the starting lookup directory.
 Now, for each nonfinal component of the pathname, where a component
@@ -124,6 +132,13 @@ the kernel's pathname-resolution code
 was reworked to eliminate the use of recursion,
 so that the only limit that remains is the maximum of 40
 resolutions for the entire pathname.
+.PP
+The resolution of syscalls during this stage can be blocked by using
+.BR openat2 (2),
+with the
+.B RESOLVE_NO_SYMLINKS
+flag set.
+
 .SS Step 3: find the final entry
 The lookup of the final component of the pathname goes just like
 that of all other components, as described in the previous step,
@@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
 their conventional meanings, regardless of whether they are
 actually present in the physical filesystem.
 .PP
-One cannot walk down past the root: "/.." is the same as "/".
+One cannot walk up past the root: "/.." is the same as "/".
+
 .SS Mount points
 After a "mount dev path" command, the pathname "path" refers to
 the root of the filesystem hierarchy on the device "dev", and no
@@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
 One can walk out of a mounted filesystem: "path/.." refers to
 the parent directory of "path",
 outside of the filesystem hierarchy on "dev".
+.PP
+Mount-point crossings can be blocked by using
+.BR openat2 (2),
+with the
+.B RESOLVE_NO_XDEV
+flag set (though note that this also restricts bind-mount crossings).
+
 .SS Trailing slashes
 If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
 component as in Step 2: it has to exist and resolve to a directory.
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 3/3] openat2.2: document new syscall
  2019-10-03 14:55 ` [PATCH RFC 3/3] openat2.2: document new syscall Aleksa Sarai
@ 2019-10-03 15:00   ` Aleksa Sarai
  0 siblings, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2019-10-03 15:00 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Christian Brauner, Aleksa Sarai, linux-man, linux-api, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 18111 bytes --]

Ignore this one (it's an older version of the openat2.2 patch) -- I sent
it by accident.

On 2019-10-04, Aleksa Sarai <cyphar@cyphar.com> wrote:
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  man2/open.2            |   5 +
>  man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
>  man7/path_resolution.7 |  57 ++++--
>  3 files changed, 426 insertions(+), 17 deletions(-)
>  create mode 100644 man2/openat2.2
> 
> diff --git a/man2/open.2 b/man2/open.2
> index 7217fe056e5e..a0b43394bbee 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
>  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
>  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
>  ", mode_t " mode );
> +.PP
> +/* Docuented separately, in \fBopenat2\fP(2). */
> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
>  .fi
>  .PP
>  .in -4n
> @@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
>  .B O_DIRECTORY
>  is ignored).
>  .SH SEE ALSO
> +.BR openat2 (2),
>  .BR chmod (2),
>  .BR chown (2),
>  .BR close (2),
> diff --git a/man2/openat2.2 b/man2/openat2.2
> new file mode 100644
> index 000000000000..c43c76046243
> --- /dev/null
> +++ b/man2/openat2.2
> @@ -0,0 +1,381 @@
> +.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +openat2 \- open and possibly create a file (extended)
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.B #include <sys/stat.h>
> +.B #include <fcntl.h>
> +.PP
> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call; see NOTES.
> +.SH DESCRIPTION
> +The
> +.BR openat2 ()
> +system call is an extension of
> +.BR openat (2)
> +and provides a superset of its functionality. Rather than taking a single
> +.I flag
> +argument, an extensible structure (\fIhow\fP) is passed instead to allow for
> +seamless future extensions.
> +.PP
> +.I size
> +must be set to
> +.IR "sizeof(struct open_how)" ,
> +to facilitate future extensions (see the "Extensibility" section of the
> +\fBNOTES\fP for more detail on how extensions are handled.)
> +
> +.SS The open_how structure
> +The following structure indicates how
> +.I pathname
> +should be opened, and acts as a superset of the
> +.IR flag " and " mode
> +arguments to
> +.BR openat (2).
> +.PP
> +.in +4n
> +.EX
> +struct open_how {
> +    uint32_t flags;              /* open(2)-style O_* flags. */
> +    union {
> +        uint16_t mode;           /* File mode bits for new file creation. */
> +        uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
> +    };
> +    uint32_t resolve;            /* RESOLVE_* path-resolution flags. */
> +};
> +.EE
> +.in
> +.PP
> +Any future extensions to
> +.BR openat2 ()
> +will be implemented as new fields appended to the above structure, with the
> +zero value of the new fields acting as though the extension were not present.
> +.PP
> +The meaning of each field is as follows:
> +.RS
> +
> +.I flags
> +.RS
> +The file creation and status flags to use for this operation. All of the
> +.B O_*
> +flags defined for
> +.BR openat (2)
> +are valid
> +.BR openat2 ()
> +flag values.
> +.RE
> +
> +.I upgrade_mask
> +.RS
> +Restrict with which
> +.I access modes
> +the returned
> +.B O_PATH
> +descriptor may be re-opened (either through
> +.B O_EMPTYPATH
> +or
> +.IR /proc/self/fd/ .)
> +This field may only be set to a non-zero value if
> +.I flags
> +contains
> +.BR O_PATH .
> +By default, an
> +.B O_PATH
> +file descriptor of an ordinary file may be re-opened with with any access mode (but an
> +.B O_PATH
> +file descriptor of a magic-link may only be re-opened with access modes that
> +the original magic-link possessed). The full list of
> +.I upgrade_mask
> +flags is given below.
> +.TP
> +.B UPGRADE_NOREAD
> +Do not permit the
> +.B O_PATH
> +file descriptor to be re-opened for reading (i.e.
> +.BR O_RDONLY " or " O_RDWR .)
> +.TP
> +.B UPGRADE_NOWRITE
> +Do not permit the
> +.B O_PATH
> +file descriptor to be re-opened for writing (i.e.
> +.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
> +.RE
> +
> +.I resolve
> +.RS
> +Change how the components of
> +.I pathname
> +will be resolved (see
> +.BR path_resolution (7)
> +for background information.) The primary use-case for these flags is to allow
> +trusted programs to restrict how un-trusted paths (or paths inside un-trusted
> +directories) are resolved. The full list of
> +.I resolve
> +flags is given below.
> +.TP
> +.B RESOLVE_NO_XDEV
> +Disallow all mount-point crossings during path resolution (including
> +all bind-mounts).
> +
> +Users of this flag are encouraged to make its use configurable (unless it is
> +used for a specific security purpose), as bind-mounts are very widely used by
> +end-users and thus enabling this flag globally may result in spurious errors on
> +some systems.
> +.TP
> +.B RESOLVE_NO_SYMLINKS
> +Disallow all symlink resolution during path resolution. If the trailing
> +component is a symlink, and
> +.I flags
> +contains both
> +.BR O_PATH " and " O_NOFOLLOW ","
> +then an
> +.B O_PATH
> +file descriptor referencing the symlink will be returned. This option implies
> +.BR RESOLVE_NO_MAGICLINKS .
> +
> +Users of this flag are encouraged to make its use configurable (unless it is
> +used for a specific security purpose), as symlinks are very widely used by
> +end-users and thus enabling this flag globally may result in spurious errors on
> +some systems.
> +.TP
> +.B RESOLVE_NO_MAGICLINKS
> +Disallow all magic-link resolution during path resolution. If the trailing
> +component is a magic-link, and
> +.I flags
> +contains both
> +.BR O_PATH " and " O_NOFOLLOW ","
> +then an
> +.B O_PATH
> +file descriptor referencing the magic-link will be returned.
> +
> +Magic-links are symlink-like objects that are most notably found in
> +.BR proc (5)
> +(examples include
> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> +Due to the potential danger of unknowingly opening these magic-links, it may be
> +preferable for users to disable their resolution entirely (see
> +.BR symlink (7)
> +for more details.)
> +.TP
> +.B RESOLVE_BENEATH
> +Do not permit the path resolution to succeed if any component of the resolution
> +is not a descendant of the directory indicated by
> +.IR dirfd .
> +This results in absolute symlinks (and absolute values of
> +.IR pathname )
> +to be rejected. Magic-link resolution is also not permitted.
> +
> +.TP
> +.B RESOLVE_IN_ROOT
> +Temporarily treat
> +.I dirfd
> +as the root of the filesystem (as though the user called
> +.BR chroot (2)
> +with
> +.IR dirfd
> +as the argument.) Absolute symlinks and ".." path components will be scoped to
> +.IR dirfd . Magic-link resolution is also not permitted.
> +
> +However, unlike
> +.BR chroot (2)
> +(which changes the filesystem root persistently for an entire thread-group),
> +.B RESOLVE_IN_ROOT
> +allows a program to efficiently restrict path resolution for only certain
> +operations. It also has several hardening features (such as not permitting
> +magic-link resolution) which
> +.BR chroot (2)
> +does not.
> +.RE
> +
> +.RE
> +
> +.PP
> +Unlike
> +.BR openat (2),
> +any unknown flags set in fields of
> +.I how
> +will result in an error, rather than being ignored. In addition, an error will
> +be returned if the value of the
> +.IR mode " and " upgrade_mask
> +union is non-zero unless:
> +.RS
> +.IP * 3
> +.I flags
> +indicates that a new file will be created (it contains
> +.BR O_CREAT " or " O_TMPFILE ),
> +in which case
> +.I mode
> +may be any valid file mode.
> +.IP *
> +.I flags
> +contains
> +.BR O_PATH ,
> +in which case
> +.I upgrade_mask
> +must only contain valid
> +.B UPGRADE_*
> +flags.
> +.RE
> +
> +.SH RETURN VALUE
> +On success, a new file descriptor is returned. On error, -1 is returned, and
> +.I errno
> +is set appropriately.
> +
> +.SH ERRORS
> +The set of errors returned by
> +.BR openat2 ()
> +includes all of the errors returned by
> +.BR openat (2),
> +as well as the following additional errors:
> +.TP
> +.B EINVAL
> +An unknown flag or invalid value was specified in
> +.IR how .
> +.TP
> +.B EINVAL
> +.I size
> +was smaller than any known version of
> +.IR "struct open_how" .
> +.TP
> +.B E2BIG
> +An extension was specified in
> +.IR how ,
> +which the current kernel does not support (see the "Extensibility" section of
> +the \fBNOTES\fP for more detail on how extensions are handled.)
> +.TP
> +.B EAGAIN
> +.I resolve
> +contains either
> +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> +and the kernel could not ensure that a ".." component didn't escape (due to a
> +race condition or potential attack). Callers may choose to retry the
> +.BR openat2 ()
> +call.
> +.TP
> +.B EXDEV
> +.I resolve
> +contains either
> +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> +and a path component attempted to escape the root of the resolution.
> +
> +.TP
> +.B EXDEV
> +.I resolve
> +contains
> +.BR RESOLVE_NO_XDEV ,
> +and a path component attempted to cross a mount-point.
> +
> +.TP
> +.B ELOOP
> +.I resolve
> +contains
> +.BR RESOLVE_NO_SYMLINKS ,
> +and one of the path components was a symlink.
> +.TP
> +.B ELOOP
> +.I resolve
> +contains
> +.BR RESOLVE_NO_MAGICLINKS ,
> +and one of the path components was a magic-link.
> +
> +.SH VERSIONS
> +.BR openat2 ()
> +was added to Linux in kernel 5.FOO.
> +
> +.SH CONFORMING TO
> +This system call is Linux-specific.
> +
> +The semantics of
> +.B RESOLVE_BENEATH
> +were modelled after FreeBSD's
> +.BR O_BENEATH .
> +
> +.SH NOTES
> +Glibc does not provide a wrapper for this system call; call it using
> +.BR syscall (2).
> +
> +.SS Extensibility
> +In order to allow for
> +.I struct open_how
> +to be extended in future kernel revisions,
> +.BR openat2 ()
> +requires userspace to specify what sized
> +.I struct open_how
> +structure they are passing. By providing this information, it is possible for
> +.BR openat2 ()
> +to provide both forwards- and backwards-compatibility \(em with
> +.I size
> +acting as an implicit version number (because new extension fields will always
> +be appended, the size will always increase.) This extensibility design is very
> +similar to other system calls such as
> +.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
> +
> +If we let
> +.I usize
> +be the size of the structure according to userspace and
> +.I ksize
> +be the size of the structure which the kernel supports, then there are only
> +three cases to consider:
> +
> +.RS
> +.IP * 3
> +If
> +.IR ksize " equals " usize ,
> +then there is no version mismatch and
> +.I how
> +can be used verbatim.
> +.IP *
> +If
> +.IR ksize " is larger than " usize ,
> +then there are some extensions the kernel supports which the userspace program
> +is unaware of. Because all extensions must have their zero values be a no-op,
> +the kernel treats all of the extension fields not set by userspace to have zero
> +values. This provides backwards-compatibility.
> +.IP *
> +If
> +.IR ksize " is smaller than " usize ,
> +then there are some extensions which the userspace program is aware of but the
> +kernel does not support. Because all extensions must have their zero values be
> +a no-op, the kernel can safely ignore the unsupported extension fields if they
> +are all-zero. If any unsupported extension fields are non-zero, then an error
> +is returned. This provides forwards-compatibility.
> +.RE
> +
> +Therefore, most userspace programs will not need to have any special handling
> +of extensions. However, if a userspace program wishes to determine what
> +extensions the running kernel supports, they may conduct a binary search on
> +.IR size
> +(to find the largest value which doesn't produce an error.)
> +
> +.SH SEE ALSO
> +.BR openat (2),
> +.BR path_resolution (7),
> +.BR symlink (7)
> diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> index 85dd354e9a93..3da3e5b614c8 100644
> --- a/man7/path_resolution.7
> +++ b/man7/path_resolution.7
> @@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
>  Some UNIX/Linux system calls have as parameter one or more filenames.
>  A filename (or pathname) is resolved as follows.
>  .SS Step 1: start of the resolution process
> -If the pathname starts with the \(aq/\(aq character,
> -the starting lookup directory
> -is the root directory of the calling process.
> -(A process inherits its
> -root directory from its parent.
> -Usually this will be the root directory
> -of the file hierarchy.
> -A process may get a different root directory
> -by use of the
> +If the pathname starts with the \(aq/\(aq character, the starting lookup
> +directory is the root directory of the calling process. (A process inherits its
> +root directory from its parent. Usually this will be the root directory of the
> +file hierarchy. A process may get a different root directory by use of the
>  .BR chroot (2)
> -system call.
> +system call, or may temporarily use a different root directory by using
> +.BR openat2 (2)
> +with the
> +.B RESOLVE_IN_ROOT
> +flag set.
> +.PP
>  A process may get an entirely private mount namespace in case
>  it\(emor one of its ancestors\(emwas started by an invocation of the
>  .BR clone (2)
> @@ -48,16 +48,24 @@ system call that had the
>  flag set.)
>  This handles the \(aq/\(aq part of the pathname.
>  .PP
> -If the pathname does not start with the \(aq/\(aq character, the
> -starting lookup directory of the resolution process is the current working
> -directory of the process.
> -(This is also inherited from the parent.
> -It can be changed by use of the
> +If the pathname does not start with the \(aq/\(aq character, the starting
> +lookup directory of the resolution process is the current working directory of
> +the process \(em or in the case of
> +.BR openat (2)-style
> +syscalls, the
> +.I dfd
> +argument (or the current working directory if
> +.B AT_FDCWD
> +is passed as the
> +.I dfd
> +argumnet). The current working directory is inherited from the parent, and can
> +be changed by use of the
>  .BR chdir (2)
> -system call.)
> +syscall.
>  .PP
>  Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
>  Pathnames not starting with a \(aq/\(aq are called relative pathnames.
> +
>  .SS Step 2: walk along the path
>  Set the current lookup directory to the starting lookup directory.
>  Now, for each nonfinal component of the pathname, where a component
> @@ -124,6 +132,13 @@ the kernel's pathname-resolution code
>  was reworked to eliminate the use of recursion,
>  so that the only limit that remains is the maximum of 40
>  resolutions for the entire pathname.
> +.PP
> +The resolution of syscalls during this stage can be blocked by using
> +.BR openat2 (2),
> +with the
> +.B RESOLVE_NO_SYMLINKS
> +flag set.
> +
>  .SS Step 3: find the final entry
>  The lookup of the final component of the pathname goes just like
>  that of all other components, as described in the previous step,
> @@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
>  their conventional meanings, regardless of whether they are
>  actually present in the physical filesystem.
>  .PP
> -One cannot walk down past the root: "/.." is the same as "/".
> +One cannot walk up past the root: "/.." is the same as "/".
> +
>  .SS Mount points
>  After a "mount dev path" command, the pathname "path" refers to
>  the root of the filesystem hierarchy on the device "dev", and no
> @@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
>  One can walk out of a mounted filesystem: "path/.." refers to
>  the parent directory of "path",
>  outside of the filesystem hierarchy on "dev".
> +.PP
> +Mount-point crossings can be blocked by using
> +.BR openat2 (2),
> +with the
> +.B RESOLVE_NO_XDEV
> +flag set (though note that this also restricts bind-mount crossings).
> +
>  .SS Trailing slashes
>  If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
>  component as in Step 2: it has to exist and resolve to a directory.
> -- 
> 2.23.0
> 


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely
  2019-10-03 14:55 ` [PATCH RFC 1/3] symlink.7: document magic-links more completely Aleksa Sarai
@ 2019-10-07 16:36   ` Jann Horn
  2019-10-08  1:33     ` Aleksa Sarai
  2019-10-09  7:55   ` Michael Kerrisk (man-pages)
  2020-04-17 11:39   ` Michael Kerrisk (man-pages)
  2 siblings, 1 reply; 18+ messages in thread
From: Jann Horn @ 2019-10-07 16:36 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Michael Kerrisk, Christian Brauner, Aleksa Sarai,
	linux-man, Linux API, kernel list

On Thu, Oct 3, 2019 at 4:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> Traditionally, magic-links have not been a well-understood topic in
> Linux. Given the new changes in their semantics (related to the link
> mode of trailing magic-links), it seems like a good opportunity to shine
> more light on magic-links and their semantics.
[...]
> +++ b/man7/symlink.7
> @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
>  are outlined here.
>  It is important that site-local applications also conform to these rules,
>  so that the user interface can be as consistent as possible.
> +.SS Magic-links
> +There is a special class of symlink-like objects known as "magic-links" which

I think names like that normally aren't hypenated in english, and
instead of "magic-links", it'd be "magic links"? Just like how you
wouldn't write "symbolic-link", but "symbolic link". But this is
bikeshedding, and if you disagree, feel free to ignore this comment.

> +can be found in certain pseudo-filesystems such as
> +.BR proc (5)
> +(examples include
> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> +Unlike normal symlinks, magic-links are not resolved through

nit: AFAICS symlinks are always referred to as "symbolic links"
throughout the manpages.

> +pathname-expansion, but instead act as direct references to the kernel's own
> +representation of a file handle. As such, these magic-links allow users to
> +access files which cannot be referenced with normal paths (such as unlinked
> +files still referenced by a running program.)

Could maybe add "and files in different mount namespaces" as another
example here; at least for me, that's the main usecases for
/proc/*/root.

[...]
> +However, magic-links do not follow this rule. They can have a non-0777 mode,
> +which is used for permission checks when the final
> +component of an
> +.BR open (2)'s

Maybe leave out the "open" part, since the same restriction has to
also apply to other syscalls operating on files, like truncate() and
so on?

> +path is a magic-link (see
> +.BR path_resolution (7).)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely
  2019-10-07 16:36   ` Jann Horn
@ 2019-10-08  1:33     ` Aleksa Sarai
  0 siblings, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2019-10-08  1:33 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, Michael Kerrisk, Christian Brauner, Aleksa Sarai,
	linux-man, Linux API, kernel list

[-- Attachment #1: Type: text/plain, Size: 2987 bytes --]

On 2019-10-07, Jann Horn <jannh@google.com> wrote:
> On Thu, Oct 3, 2019 at 4:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > Traditionally, magic-links have not been a well-understood topic in
> > Linux. Given the new changes in their semantics (related to the link
> > mode of trailing magic-links), it seems like a good opportunity to shine
> > more light on magic-links and their semantics.
> [...]
> > +++ b/man7/symlink.7
> > @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
> >  are outlined here.
> >  It is important that site-local applications also conform to these rules,
> >  so that the user interface can be as consistent as possible.
> > +.SS Magic-links
> > +There is a special class of symlink-like objects known as "magic-links" which
> 
> I think names like that normally aren't hypenated in english, and
> instead of "magic-links", it'd be "magic links"? Just like how you
> wouldn't write "symbolic-link", but "symbolic link". But this is
> bikeshedding, and if you disagree, feel free to ignore this comment.

Looking at it now, I think you're right -- I hyphenated it here because
that's how I wrote it when documenting the feature in comments. But I
think that's because "symlink" and "magic-link" (the "abbreviated"
versions) seem to match better than "symlink" and "magic link".

I'll use "magic link" in documentation, but "magic-link" for all cases
where I would normally write "symlink".

> > +can be found in certain pseudo-filesystems such as
> > +.BR proc (5)
> > +(examples include
> > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> > +Unlike normal symlinks, magic-links are not resolved through
> 
> nit: AFAICS symlinks are always referred to as "symbolic links"
> throughout the manpages.

:+1:

> > +pathname-expansion, but instead act as direct references to the kernel's own
> > +representation of a file handle. As such, these magic-links allow users to
> > +access files which cannot be referenced with normal paths (such as unlinked
> > +files still referenced by a running program.)
> 
> Could maybe add "and files in different mount namespaces" as another
> example here; at least for me, that's the main usecases for
> /proc/*/root.

Will do.

> [...]
> > +However, magic-links do not follow this rule. They can have a non-0777 mode,
> > +which is used for permission checks when the final
> > +component of an
> > +.BR open (2)'s
> 
> Maybe leave out the "open" part, since the same restriction has to
> also apply to other syscalls operating on files, like truncate() and
> so on?

Yes (though I've just realised I hadn't implemented that -- oops.) Given
how expansive this patchset will get -- I might end up splitting it into
the magic-link stuff (and O_EMPTYPATH) and a separate series for
openat2(2) and the path resolution restrictions.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely
  2019-10-03 14:55 ` [PATCH RFC 1/3] symlink.7: document magic-links more completely Aleksa Sarai
  2019-10-07 16:36   ` Jann Horn
@ 2019-10-09  7:55   ` Michael Kerrisk (man-pages)
  2019-10-09  9:57     ` Aleksa Sarai
  2020-04-17 11:39   ` Michael Kerrisk (man-pages)
  2 siblings, 1 reply; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-10-09  7:55 UTC (permalink / raw)
  To: Aleksa Sarai, Al Viro
  Cc: mtk.manpages, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel

Hello Aleksa,


On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> Traditionally, magic-links have not been a well-understood topic in
> Linux. Given the new changes in their semantics (related to the link
> mode of trailing magic-links), it seems like a good opportunity to shine
> more light on magic-links and their semantics.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Thanks for doing this. Some comments below.

> ---
>  man7/path_resolution.7 | 15 +++++++++++++++
>  man7/symlink.7         | 39 ++++++++++++++++++++++++++++++---------
>  2 files changed, 45 insertions(+), 9 deletions(-)
> 
> diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> index 07664ed8faec..46f25ec4cdfa 100644
> --- a/man7/path_resolution.7
> +++ b/man7/path_resolution.7
> @@ -136,6 +136,21 @@ we are just creating it.
>  The details on the treatment
>  of the final entry are described in the manual pages of the specific
>  system calls.
> +.PP
> +Since Linux 5.FOO, if the final entry is a "magic-link" (see

"magic link". As Jann points out, this is more normal English usage.

> +.BR symlink (7)),
> +and the user is attempting to
> +.BR open (2)
> +it, then there is an additional permission-related restriction applied to the
> +operation: the requested access mode must not exceed the "link mode" of the
> +magic-link (unlike ordinary symlinks, magic-links have their own file mode.)

Remove the hyphens (magic link). And also, as someone else pointed out,
manual pages fairly consistently uses the term "symbolic link"
(written in full).

You use the term "file mode" here. Do you mean the file permissions bits?
If yes, it is a bit misleading to suggest that symbolic links don't
have these mode bits. They do, but--as noted in the existing symlink(7)
manual page text--these bits are ignored. I suggest just removing the
parenthesized text.

> +For example, if
> +.I /proc/[pid]/fd/[num]
> +has a link mode of
> +.BR 0500 ,
> +unprivileged users are not permitted to
> +.BR open ()
> +the magic-link for writing.
>  .SS . and ..
>  By convention, every directory has the entries "." and "..",
>  which refer to the directory itself and to its parent directory,
> diff --git a/man7/symlink.7 b/man7/symlink.7
> index 9f5bddd5dc21..33f0ec703acd 100644
> --- a/man7/symlink.7
> +++ b/man7/symlink.7
> @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
>  are outlined here.
>  It is important that site-local applications also conform to these rules,
>  so that the user interface can be as consistent as possible.
> +.SS Magic-links
> +There is a special class of symlink-like objects known as "magic-links" which

"magic links" (and through the rest of the page).

> +can be found in certain pseudo-filesystems such as

pseudofilesystems

> +.BR proc (5)
> +(examples include
> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> +Unlike normal symlinks, magic-links are not resolved through

symbolic links

> +pathname-expansion, but instead act as direct references to the kernel's own

pathname expansion

> +representation of a file handle. As such, these magic-links allow users to
> +access files which cannot be referenced with normal paths (such as unlinked
> +files still referenced by a running program.)
> +.PP
> +Because they can bypass ordinary
> +.BR mount_namespaces (7)-based
> +restrictions, magic-links have been used as attack vectors in various exploits.
> +As such (since Linux 5.FOO), there are additional restrictions placed on the
> +re-opening of magic-links (see
> +.BR path_resolution (7)
> +for more details.)
>  .SS Symbolic link ownership, permissions, and timestamps
>  The owner and group of an existing symbolic link can be changed
>  using
> @@ -99,16 +118,18 @@ of a symbolic link can be changed using
>  or
>  .BR lutimes (3).
>  .PP
> -On Linux, the permissions of a symbolic link are not used
> -in any operations; the permissions are always
> -0777 (read, write, and execute for all user categories),
>  .\" Linux does not currently implement an lchmod(2).
> -and can't be changed.
> -(Note that there are some "magic" symbolic links in the
> -.I /proc
> -directory tree\(emfor example, the
> -.IR /proc/[pid]/fd/*
> -files\(emthat have different permissions.)
> +On Linux, the permissions of an ordinary symbolic link are not used in any
> +operations; the permissions are always 0777 (read, write, and execute for all
> +user categories), and can't be changed.
> +.PP
> +However, magic-links do not follow this rule. They can have a non-0777 mode,
> +which is used for permission checks when the final
> +component of an
> +.BR open (2)'s
> +path is a magic-link (see
> +.BR path_resolution (7).)
> +
>  .\"
>  .\" The
>  .\" 4.4BSD

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation
  2019-10-03 14:55 ` [PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation Aleksa Sarai
@ 2019-10-09  8:01   ` Michael Kerrisk (man-pages)
  2019-10-09 10:00     ` Aleksa Sarai
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-10-09  8:01 UTC (permalink / raw)
  To: Aleksa Sarai, Al Viro
  Cc: mtk.manpages, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel

Hello Aleksa,

You write "5.FOO" in these patches. When do you expect these changes to 
land in the kernel?

On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> Some of the wording around empty paths in path_resolution(7) also needed
> to be reworked since it's now legal (if you pass O_EMPTYPATH).
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  man2/open.2            | 42 +++++++++++++++++++++++++++++++++++++++++-
>  man7/path_resolution.7 | 17 ++++++++++++++++-
>  2 files changed, 57 insertions(+), 2 deletions(-)
> 
> diff --git a/man2/open.2 b/man2/open.2
> index b0f485b41589..7217fe056e5e 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -48,7 +48,7 @@
>  .\" FIXME . Apr 08: The next POSIX revision has O_EXEC, O_SEARCH, and
>  .\" O_TTYINIT.  Eventually these may need to be documented.  --mtk
>  .\"
> -.TH OPEN 2 2018-04-30 "Linux" "Linux Programmer's Manual"
> +.TH OPEN 2 2019-10-03 "Linux" "Linux Programmer's Manual"

No need to update the timestamp. I have scripts that handle this
automatically.

>  .SH NAME
>  open, openat, creat \- open and possibly create a file
>  .SH SYNOPSIS
> @@ -421,6 +421,21 @@ was followed by a call to
>  .BR fdatasync (2)).
>  .IR "See NOTES below" .
>  .TP
> +.BR O_EMPTYPATH " (since Linux 5.FOO)"
> +If \fIpathname\fP is an empty string, re-open the the file descriptor given as

In general, I prefer the general form

.I pathname

over \fIpathname\fP. 

If you would be willing to cahnge that, it would  save me a little work.
(And likewise throughout the rest of the patch.)

> +the \fIdirfd\fP argument to
> +.BR openat (2).
> +This can be used with both ordinary (file and directory) and \fBO_PATH\fP file
> +descriptors, but cannot be used with
> +.BR AT_FDCWD
> +(or as an argument to plain
> +.BR open (2).) When re-opening an \fBO_PATH\fP file descriptor, the same "link

There's a formatting problem here which can be fixed by inserting a 
newline before "When".

> +mode" restrictions apply as with re-opening through
> +.BR proc (5)
> +(see
> +.BR path_resolution "(7) and " symlink (7)
> +for more details.)
> +.TP
>  .B O_EXCL
>  Ensure that this call creates the file:
>  if this flag is specified in conjunction with
> @@ -668,6 +683,13 @@ with
>  (or via procfs using
>  .BR AT_SYMLINK_FOLLOW )
>  even if the file is not a directory.
> +You can even "re-open" (or upgrade) an
> +.BR O_PATH
> +file descriptor by using
> +.BR O_EMPTYPATH
> +(see the section for
> +.BR O_EMPTYPATH
> +for more details.)
>  .IP *
>  Passing the file descriptor to another process via a UNIX domain socket
>  (see
> @@ -958,6 +980,15 @@ is not allowed.
>  (See also
>  .BR path_resolution (7).)
>  .TP
> +.B EBADF
> +.I pathname
> +was an empty string (and
> +.B O_EMPTYPATH
> +was passed) with
> +.BR open (2)
> +(instead of
> +.BR openat (2).)
> +.TP
>  .B EDQUOT
>  Where
>  .B O_CREAT
> @@ -1203,6 +1234,15 @@ The following additional errors can occur for
>  .I dirfd
>  is not a valid file descriptor.
>  .TP
> +.B EBADF
> +.I pathname
> +was an empty string (and
> +.B O_EMPTYPATH
> +was passed), but the provided
> +.I dirfd
> +was an invalid file descriptor (or was
> +.BR AT_FDCWD .)
> +.TP
>  .B ENOTDIR
>  .I pathname
>  is a relative pathname and
> diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> index 46f25ec4cdfa..85dd354e9a93 100644
> --- a/man7/path_resolution.7
> +++ b/man7/path_resolution.7
> @@ -22,7 +22,7 @@
>  .\" the source, must acknowledge the copyright and authors of this work.
>  .\" %%%LICENSE_END
>  .\"
> -.TH PATH_RESOLUTION 7 2017-11-26 "Linux" "Linux Programmer's Manual"
> +.TH PATH_RESOLUTION 7 2019-10-03 "Linux" "Linux Programmer's Manual"
>  .SH NAME
>  path_resolution \- how a pathname is resolved to a file
>  .SH DESCRIPTION
> @@ -198,6 +198,21 @@ successfully.
>  Linux returns
>  .B ENOENT
>  in this case.
> +.PP
> +As of Linux 5.FOO, an empty path argument can be used to indicate the "re-open"
> +an existing file descriptor if
> +.B O_EMPTYPATH
> +is passed as a flag argument to
> +.BR openat (2),
> +with the
> +.I dfd
> +argument indicating which file descriptor to "re-open". This is approximately
> +equivalent to opening
> +.I /proc/self/fd/$fd

.IR /proc/self/fd/$fd ,

> +where
> +.I $fd
> +is the open file descriptor to be "re-opened".
> +

No blank line here.

>  .SS Permissions
>  The permission bits of a file consist of three groups of three bits; see
>  .BR chmod (1)
> 

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall
  2019-10-03 14:55 ` [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall Aleksa Sarai
@ 2019-10-09  8:36   ` Michael Kerrisk (man-pages)
  2019-10-09 10:17     ` Aleksa Sarai
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-10-09  8:36 UTC (permalink / raw)
  To: Aleksa Sarai, Al Viro
  Cc: mtk.manpages, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel

Hello Aleksa,

Thanks for this. It's a great piece of documentation work!

I would prefer the path_resolution(7) piece as a separate patch.


On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> Rather than trying to merge the new syscall documentation into open.2
> (which would probably result in the man-page being incomprehensible),
> instead the new syscall gets its own dedicated page with links between
> open(2) and openat2(2) to avoid duplicating information such as the list
> of O_* flags or common errors.

Yes, looking at the size of the proposed openat2(2) page,
this seems best.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  man2/open.2            |   5 +
>  man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
>  man7/path_resolution.7 |  57 ++++--
>  3 files changed, 426 insertions(+), 17 deletions(-)
>  create mode 100644 man2/openat2.2
> 
> diff --git a/man2/open.2 b/man2/open.2
> index 7217fe056e5e..a0b43394bbee 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
>  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
>  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
>  ", mode_t " mode );
> +.PP
> +/* Docuented separately, in \fBopenat2\fP(2). */

Documented

> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
>  .fi
>  .PP
>  .in -4n
> @@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
>  .B O_DIRECTORY
>  is ignored).
>  .SH SEE ALSO
> +.BR openat2 (2),

Entries here should into alphabetical order (within
sections).

>  .BR chmod (2),
>  .BR chown (2),
>  .BR close (2),
> diff --git a/man2/openat2.2 b/man2/openat2.2
> new file mode 100644
> index 000000000000..c43c76046243
> --- /dev/null
> +++ b/man2/openat2.2
> @@ -0,0 +1,381 @@
> +.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +openat2 \- open and possibly create a file (extended)
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.B #include <sys/stat.h>
> +.B #include <fcntl.h>
> +.PP
> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call; see NOTES.
> +.SH DESCRIPTION
> +The
> +.BR openat2 ()
> +system call is an extension of
> +.BR openat (2)
> +and provides a superset of its functionality. Rather than taking a single

Please start new sentences on new source lines. I recently added this
text in man-pages(7):

   Use semantic newlines
       In the source of a manual page, new sentences should be started on
       new  lines,  and  long sentences should split into lines at clause
       breaks (commas, semicolons, colons, and so on).  This  convention,
       sometimes known as "semantic newlines", makes it easier to see the
       effect of patches, which often operate at the level of  individual
       sentences or sentence clauses.

> +.I flag
> +argument, an extensible structure (\fIhow\fP) is passed instead to allow for
> +seamless future extensions.

s/seamless//

> +.PP
> +.I size
> +must be set to
> +.IR "sizeof(struct open_how)" ,
> +to facilitate future extensions (see the "Extensibility" section of the
> +\fBNOTES\fP for more detail on how extensions are handled.)
> +
> +.SS The open_how structure
> +The following structure indicates how
> +.I pathname
> +should be opened, and acts as a superset of the
> +.IR flag " and " mode
> +arguments to
> +.BR openat (2).
> +.PP
> +.in +4n
> +.EX
> +struct open_how {
> +    uint32_t flags;              /* open(2)-style O_* flags. */
> +    union {
> +        uint16_t mode;           /* File mode bits for new file creation. */
> +        uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
> +    };
> +    uint32_t resolve;            /* RESOLVE_* path-resolution flags. */
> +};
> +.EE
> +.in
> +.PP
> +Any future extensions to
> +.BR openat2 ()
> +will be implemented as new fields appended to the above structure, with the
> +zero value of the new fields acting as though the extension were not present.
> +.PP
> +The meaning of each field is as follows:
> +.RS
> +
> +.I flags
> +.RSall
> +The file creation and status flags to use for this operation. All of the
> +.B O_*
> +flags defined for
> +.BR openat (2)
> +are valid
> +.BR openat2 ()
> +flag values.
> +.RE
> +
> +.I upgrade_mask
> +.RS
> +Restrict with which
> +.I access modes
> +the returned
> +.B O_PATH
> +descriptor may be re-opened (either through
> +.B O_EMPTYPATH
> +or
> +.IR /proc/self/fd/ .)
> +This field may only be set to a non-zero value if
> +.I flags
> +contains
> +.BR O_PATH .
> +By default, an
> +.B O_PATH
> +file descriptor of an ordinary file may be re-opened with with any access mode (but an
> +.B O_PATH
> +file descriptor of a magic-link may only be re-opened with access modes that
> +the original magic-link possessed). The full list of

magic link (throughout the page)

> +.I upgrade_mask
> +flags is given below.
> +.TP
> +.B UPGRADE_NOREAD
> +Do not permit the
> +.B O_PATH
> +file descriptor to be re-opened for reading (i.e.
> +.BR O_RDONLY " or " O_RDWR .)
> +.TP
> +.B UPGRADE_NOWRITE
> +Do not permit the
> +.B O_PATH
> +file descriptor to be re-opened for writing (i.e.
> +.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
> +.RE
> +.I resolve
> +.RS
> +Change how the components of
> +.I pathname
> +will be resolved (see
> +.BR path_resolution (7)
> +for background information.) The primary use-case for these flags is to allow

use case

> +trusted programs to restrict how un-trusted paths (or paths inside un-trusted

untrusted

> +directories) are resolved. The full list of
> +.I resolve
> +flags is given below.
> +.TP
> +.B RESOLVE_NO_XDEV
> +Disallow all mount-point crossings during path resolution (including

I think better would be: "Disallow traversal of mount points". Do you 
agree?

> +all bind-mounts).

bind mounts

> +
> +Users of this flag are encouraged to make its use configurable (unless it is
> +used for a specific security purpose), as bind-mounts are very widely used by
> +end-users and thus enabling this flag globally may result in spurious errors on
> +some systems.
> +.TP
> +.B RESOLVE_NO_SYMLINKS
> +Disallow all symlink resolution during path resolution. If the trailing

Disallow resolution of symbolic links during path resolution

> +component is a symlink, and

symbolic link (throughout the page)

> +.I flags
> +contains both
> +.BR O_PATH " and " O_NOFOLLOW ","
> +then an
> +.B O_PATH
> +file descriptor referencing the symlink will be returned. This option implies
> +.BR RESOLVE_NO_MAGICLINKS .
> +
> +Users of this flag are encouraged to make its use configurable (unless it is
> +used for a specific security purpose), as symlinks are very widely used by
> +end-users and thus enabling this flag globally may result in spurious errors on
> +some systems.

It's not really clear what you mean by "enabling this flag globally".
Could you reword, or explain in a bit more detail?

> +.TP
> +.B RESOLVE_NO_MAGICLINKS
> +Disallow all magic-link resolution during path resolution. If the trailing
> +component is a magic-link, and
> +.I flags
> +contains both
> +.BR O_PATH " and " O_NOFOLLOW ","
> +then an
> +.B O_PATH
> +file descriptor referencing the magic-link will be returned.
> +
> +Magic-links are symlink-like objects that are most notably found in
> +.BR proc (5)
> +(examples include
> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> +Due to the potential danger of unknowingly opening these magic-links, it may be
> +preferable for users to disable their resolution entirely (see
> +.BR symlink (7)
> +for more details.)
> +.TP
> +.B RESOLVE_BENEATH
> +Do not permit the path resolution to succeed if any component of the resolution
> +is not a descendant of the directory indicated by
> +.IR dirfd .
> +This results in absolute symlinks (and absolute values of
> +.IR pathname )
> +to be rejected. Magic-link resolution is also not permitted.

So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
it would be good to state that more explicitly,

> +
> +.TP
> +.B RESOLVE_IN_ROOT
> +Temporarily treat
> +.I dirfd
> +as the root of the filesystem (as though the user called

Perhaps better:

Treat
.I dirfd
as the root directory while resolving
.I pathname
(as though...)

> +.BR chroot (2)
> +with
> +.IR dirfd
> +as the argument.) Absolute symlinks and ".." path components will be scoped to
> +.IR dirfd . Magic-link resolution is also not permitted.

Insert a newline before "Magic" to fix a formatting problem.

So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
it would be good to state that more explicitly,

> +
> +However, unlike
> +.BR chroot (2)
> +(which changes the filesystem root persistently for an entire thread-group),

s/persistently for an entire thread-group/
 /permanently for a process/

> +.B RESOLVE_IN_ROOT
> +allows a program to efficiently restrict path resolution for only certain
> +operations. It also has several hardening features (such as not permitting
> +magic-link resolution) which
> +.BR chroot (2)
> +does not.
> +.RE
> +
> +.RE
> +
> +.PP
> +Unlike
> +.BR openat (2),
> +any unknown flags set in fields of
> +.I how
> +will result in an error, rather than being ignored. 

Thank you, thank you, thank you. It was sad
that openat() never fixed that antifeature.

> In addition, an error will
> +be returned if the value of the
> +.IR mode " and " upgrade_mask
> +union is non-zero unless:
> +.RS
> +.IP * 3
> +.I flags
> +indicates that a new file will be created (it contains
> +.BR O_CREAT " or " O_TMPFILE ),
> +in which case
> +.I mode
> +may be any valid file mode.
> +.IP *
> +.I flags
> +contains
> +.BR O_PATH ,
> +in which case
> +.I upgrade_mask
> +must only contain valid
> +.B UPGRADE_*
> +flags.
> +.RE
> +
> +.SH RETURN VALUE
> +On success, a new file descriptor is returned. On error, -1 is returned, and
> +.I errno
> +is set appropriately.
> +
> +.SH ERRORS
> +The set of errors returned by
> +.BR openat2 ()
> +includes all of the errors returned by
> +.BR openat (2),
> +as well as the following additional errors:
> +.TP
> +.B EINVAL
> +An unknown flag or invalid value was specified in
> +.IR how .
> +.TP
> +.B EINVAL
> +.I size
> +was smaller than any known version of
> +.IR "struct open_how" .
> +.TP
> +.B E2BIG
> +An extension was specified in
> +.IR how ,
> +which the current kernel does not support (see the "Extensibility" section of
> +the \fBNOTES\fP for more detail on how extensions are handled.)
> +.TP
> +.B EAGAIN
> +.I resolve
> +contains either
> +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> +and the kernel could not ensure that a ".." component didn't escape (due to a
> +race condition or potential attack). Callers may choose to retry the
> +.BR openat2 ()
> +call.
> +.TP
> +.B EXDEV
> +.I resolve
> +contains either
> +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> +and a path component attempted to escape the root of the resolution.
> +
> +.TP
> +.B EXDEV
> +.I resolve
> +contains
> +.BR RESOLVE_NO_XDEV ,
> +and a path component attempted to cross a mount-point.

mount point

> +
> +.TP
> +.B ELOOP
> +.I resolve
> +contains
> +.BR RESOLVE_NO_SYMLINKS ,
> +and one of the path components was a symlink.
> +.TP
> +.B ELOOP
> +.I resolve
> +contains
> +.BR RESOLVE_NO_MAGICLINKS ,
> +and one of the path components was a magic-link.
> +
> +.SH VERSIONS
> +.BR openat2 ()
> +was added to Linux in kernel 5.FOO.
> +
> +.SH CONFORMING TO
> +This system call is Linux-specific.
> +
> +The semantics of
> +.B RESOLVE_BENEATH
> +were modelled after FreeBSD's
> +.BR O_BENEATH .
> +
> +.SH NOTES
> +Glibc does not provide a wrapper for this system call; call it using
> +.BR syscall (2).
> +
> +.SS Extensibility
> +In order to allow for
> +.I struct open_how
> +to be extended in future kernel revisions,
> +.BR openat2 ()
> +requires userspace to specify what sized

s/what sized/the size of/

> +.I struct open_how
> +structure they are passing. By providing this information, it is possible for
> +.BR openat2 ()
> +to provide both forwards- and backwards-compatibility \(em with
> +.I size
> +acting as an implicit version number (because new extension fields will always
> +be appended, the size will always increase.) This extensibility design is very
> +similar to other system calls such as
> +.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).

The following explantion of uszie and ksize is great. Thanks for that.

> +If we let
> +.I usize
> +be the size of the structure according to userspace and
> +.I ksize
> +be the size of the structure which the kernel supports, then there are only
> +three cases to consider:
> +
> +.RS
> +.IP * 3
> +If
> +.IR ksize " equals " usize ,
> +then there is no version mismatch and
> +.I how
> +can be used verbatim.
> +.IP *
> +If
> +.IR ksize " is larger than " usize ,
> +then there are some extensions the kernel supports which the userspace program
> +is unaware of. Because all extensions must have their zero values be a no-op,
> +the kernel treats all of the extension fields not set by userspace to have zero
> +values. This provides backwards-compatibility.
> +.IP *
> +If
> +.IR ksize " is smaller than " usize ,
> +then there are some extensions which the userspace program is aware of but the
> +kernel does not support. Because all extensions must have their zero values be
> +a no-op, the kernel can safely ignore the unsupported extension fields if they
> +are all-zero. If any unsupported extension fields are non-zero, then an error
> +is returned. This provides forwards-compatibility.
> +.RE
> +
> +Therefore, most userspace programs will not need to have any special handling
> +of extensions. However, if a userspace program wishes to determine what
> +extensions the running kernel supports, they may conduct a binary search on
> +.IR size
> +(to find the largest value which doesn't produce an error.)
> +
> +.SH SEE ALSO
> +.BR openat (2),
> +.BR path_resolution (7),
> +.BR symlink (7)
> diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> index 85dd354e9a93..3da3e5b614c8 100644
> --- a/man7/path_resolution.7
> +++ b/man7/path_resolution.7
> @@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
>  Some UNIX/Linux system calls have as parameter one or more filenames.
>  A filename (or pathname) is resolved as follows.
>  .SS Step 1: start of the resolution process
> -If the pathname starts with the \(aq/\(aq character,
> -the starting lookup directory
> -is the root directory of the calling process.
> -(A process inherits its
> -root directory from its parent.
> -Usually this will be the root directory
> -of the file hierarchy.
> -A process may get a different root directory
> -by use of the
> +If the pathname starts with the \(aq/\(aq character, the starting lookup
> +directory is the root directory of the calling process. (A process inherits its
> +root directory from its parent. Usually this will be the root directory of the
> +file hierarchy. A process may get a different root directory by use of the
>  .BR chroot (2)
> -system call.
> +system call, or may temporarily use a different root directory by using
> +.BR openat2 (2)
> +with the
> +.B RESOLVE_IN_ROOT
> +flag set.
> +.PP
>  A process may get an entirely private mount namespace in case
>  it\(emor one of its ancestors\(emwas started by an invocation of the
>  .BR clone (2)
> @@ -48,16 +48,24 @@ system call that had the
>  flag set.)
>  This handles the \(aq/\(aq part of the pathname.
>  .PP
> -If the pathname does not start with the \(aq/\(aq character, the
> -starting lookup directory of the resolution process is the current working
> -directory of the process.
> -(This is also inherited from the parent.
> -It can be changed by use of the
> +If the pathname does not start with the \(aq/\(aq character, the starting
> +lookup directory of the resolution process is the current working directory of
> +the process \(em or in the case of
> +.BR openat (2)-style
> +syscalls, the

system calls

> +.I dfd
> +argument (or the current working directory if
> +.B AT_FDCWD
> +is passed as the
> +.I dfd
> +argumnet). The current working directory is inherited from the parent, and can

argument

> +be changed by use of the
>  .BR chdir (2)
> -system call.)
> +syscall.

"system call" please.

>  .PP
>  Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
>  Pathnames not starting with a \(aq/\(aq are called relative pathnames.
> +

No blank line here.

>  .SS Step 2: walk along the path
>  Set the current lookup directory to the starting lookup directory.
>  Now, for each nonfinal component of the pathname, where a component
> @@ -124,6 +132,13 @@ the kernel's pathname-resolution code
>  was reworked to eliminate the use of recursion,
>  so that the only limit that remains is the maximum of 40
>  resolutions for the entire pathname.
> +.PP
> +The resolution of syscalls during this stage can be blocked by using

"resolution of syscall" seems wrong? "syscall" should be something 
else?

> +.BR openat2 (2),
> +with the
> +.B RESOLVE_NO_SYMLINKS
> +flag set.
> +
>  .SS Step 3: find the final entry
>  The lookup of the final component of the pathname goes just like
>  that of all other components, as described in the previous step,
> @@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
>  their conventional meanings, regardless of whether they are
>  actually present in the physical filesystem.
>  .PP
> -One cannot walk down past the root: "/.." is the same as "/".
> +One cannot walk up past the root: "/.." is the same as "/".
> +

No blank line please.

>  .SS Mount points
>  After a "mount dev path" command, the pathname "path" refers to
>  the root of the filesystem hierarchy on the device "dev", and no
> @@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
>  One can walk out of a mounted filesystem: "path/.." refers to
>  the parent directory of "path",
>  outside of the filesystem hierarchy on "dev".
> +.PP
> +Mount-point crossings can be blocked by using

Traversal of mount points can be disallowed by...

> +.BR openat2 (2),
> +with the
> +.B RESOLVE_NO_XDEV
> +flag set (though note that this also restricts bind-mount crossings).
> +

No blank line please.

>  .SS Trailing slashes
>  If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
>  component as in Step 2: it has to exist and resolve to a directory.
> 

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely
  2019-10-09  7:55   ` Michael Kerrisk (man-pages)
@ 2019-10-09  9:57     ` Aleksa Sarai
  0 siblings, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2019-10-09  9:57 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Al Viro, Christian Brauner, Aleksa Sarai, linux-man, linux-api,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5588 bytes --]

On 2019-10-09, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> > Traditionally, magic-links have not been a well-understood topic in
> > Linux. Given the new changes in their semantics (related to the link
> > mode of trailing magic-links), it seems like a good opportunity to shine
> > more light on magic-links and their semantics.
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> Thanks for doing this. Some comments below.

No problem -- just a heads-up that I'm going to split off the magic-link
changes from the openat2(2) series (there are quite a few things that
need to be done). So I will drop this man page for now.

> > ---
> >  man7/path_resolution.7 | 15 +++++++++++++++
> >  man7/symlink.7         | 39 ++++++++++++++++++++++++++++++---------
> >  2 files changed, 45 insertions(+), 9 deletions(-)
> > 
> > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> > index 07664ed8faec..46f25ec4cdfa 100644
> > --- a/man7/path_resolution.7
> > +++ b/man7/path_resolution.7
> > @@ -136,6 +136,21 @@ we are just creating it.
> >  The details on the treatment
> >  of the final entry are described in the manual pages of the specific
> >  system calls.
> > +.PP
> > +Since Linux 5.FOO, if the final entry is a "magic-link" (see
> 
> "magic link". As Jann points out, this is more normal English usage.
> 
> > +.BR symlink (7)),
> > +and the user is attempting to
> > +.BR open (2)
> > +it, then there is an additional permission-related restriction applied to the
> > +operation: the requested access mode must not exceed the "link mode" of the
> > +magic-link (unlike ordinary symlinks, magic-links have their own file mode.)
> 
> Remove the hyphens (magic link). And also, as someone else pointed out,
> manual pages fairly consistently uses the term "symbolic link"
> (written in full).

Will do.

> You use the term "file mode" here. Do you mean the file permissions bits?

Yes.

> If yes, it is a bit misleading to suggest that symbolic links don't
> have these mode bits. They do, but--as noted in the existing symlink(7)
> manual page text--these bits are ignored. I suggest just removing the
> parenthesized text.

I was trying to say that their file mode can be non-0777 -- but I can
just drop the entire thing.

> > +For example, if
> > +.I /proc/[pid]/fd/[num]
> > +has a link mode of
> > +.BR 0500 ,
> > +unprivileged users are not permitted to
> > +.BR open ()
> > +the magic-link for writing.
> >  .SS . and ..
> >  By convention, every directory has the entries "." and "..",
> >  which refer to the directory itself and to its parent directory,
> > diff --git a/man7/symlink.7 b/man7/symlink.7
> > index 9f5bddd5dc21..33f0ec703acd 100644
> > --- a/man7/symlink.7
> > +++ b/man7/symlink.7
> > @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
> >  are outlined here.
> >  It is important that site-local applications also conform to these rules,
> >  so that the user interface can be as consistent as possible.
> > +.SS Magic-links
> > +There is a special class of symlink-like objects known as "magic-links" which
> 
> "magic links" (and through the rest of the page).
> 
> > +can be found in certain pseudo-filesystems such as
> 
> pseudofilesystems
> 
> > +.BR proc (5)
> > +(examples include
> > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> > +Unlike normal symlinks, magic-links are not resolved through
> 
> symbolic links
> 
> > +pathname-expansion, but instead act as direct references to the kernel's own
> 
> pathname expansion

Will do all of the above.

> > +representation of a file handle. As such, these magic-links allow users to
> > +access files which cannot be referenced with normal paths (such as unlinked
> > +files still referenced by a running program.)
> > +.PP
> > +Because they can bypass ordinary
> > +.BR mount_namespaces (7)-based
> > +restrictions, magic-links have been used as attack vectors in various exploits.
> > +As such (since Linux 5.FOO), there are additional restrictions placed on the
> > +re-opening of magic-links (see
> > +.BR path_resolution (7)
> > +for more details.)
> >  .SS Symbolic link ownership, permissions, and timestamps
> >  The owner and group of an existing symbolic link can be changed
> >  using
> > @@ -99,16 +118,18 @@ of a symbolic link can be changed using
> >  or
> >  .BR lutimes (3).
> >  .PP
> > -On Linux, the permissions of a symbolic link are not used
> > -in any operations; the permissions are always
> > -0777 (read, write, and execute for all user categories),
> >  .\" Linux does not currently implement an lchmod(2).
> > -and can't be changed.
> > -(Note that there are some "magic" symbolic links in the
> > -.I /proc
> > -directory tree\(emfor example, the
> > -.IR /proc/[pid]/fd/*
> > -files\(emthat have different permissions.)
> > +On Linux, the permissions of an ordinary symbolic link are not used in any
> > +operations; the permissions are always 0777 (read, write, and execute for all
> > +user categories), and can't be changed.
> > +.PP
> > +However, magic-links do not follow this rule. They can have a non-0777 mode,
> > +which is used for permission checks when the final
> > +component of an
> > +.BR open (2)'s
> > +path is a magic-link (see
> > +.BR path_resolution (7).)
> > +
> >  .\"
> >  .\" The
> >  .\" 4.4BSD

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation
  2019-10-09  8:01   ` Michael Kerrisk (man-pages)
@ 2019-10-09 10:00     ` Aleksa Sarai
  0 siblings, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2019-10-09 10:00 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Al Viro, Christian Brauner, Aleksa Sarai, linux-man, linux-api,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5326 bytes --]

On 2019-10-09, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> Hello Aleksa,
> 
> You write "5.FOO" in these patches. When do you expect these changes to 
> land in the kernel?

Probably 5.6 (I'd hope for 5.5, but I don't know how the v14 review will
go). I'm not too sure though, and the magic-link changes (plus
O_EMPTYPATH) will probably land after openat2(2) since there is some
remaining work to do.

> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> > Some of the wording around empty paths in path_resolution(7) also needed
> > to be reworked since it's now legal (if you pass O_EMPTYPATH).
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  man2/open.2            | 42 +++++++++++++++++++++++++++++++++++++++++-
> >  man7/path_resolution.7 | 17 ++++++++++++++++-
> >  2 files changed, 57 insertions(+), 2 deletions(-)
> > 
> > diff --git a/man2/open.2 b/man2/open.2
> > index b0f485b41589..7217fe056e5e 100644
> > --- a/man2/open.2
> > +++ b/man2/open.2
> > @@ -48,7 +48,7 @@
> >  .\" FIXME . Apr 08: The next POSIX revision has O_EXEC, O_SEARCH, and
> >  .\" O_TTYINIT.  Eventually these may need to be documented.  --mtk
> >  .\"
> > -.TH OPEN 2 2018-04-30 "Linux" "Linux Programmer's Manual"
> > +.TH OPEN 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> 
> No need to update the timestamp. I have scripts that handle this
> automatically.
> 
> >  .SH NAME
> >  open, openat, creat \- open and possibly create a file
> >  .SH SYNOPSIS
> > @@ -421,6 +421,21 @@ was followed by a call to
> >  .BR fdatasync (2)).
> >  .IR "See NOTES below" .
> >  .TP
> > +.BR O_EMPTYPATH " (since Linux 5.FOO)"
> > +If \fIpathname\fP is an empty string, re-open the the file descriptor given as
> 
> In general, I prefer the general form
> 
> .I pathname
> 
> over \fIpathname\fP. 
> 
> If you would be willing to cahnge that, it would  save me a little work.
> (And likewise throughout the rest of the patch.)
> 
> > +the \fIdirfd\fP argument to
> > +.BR openat (2).
> > +This can be used with both ordinary (file and directory) and \fBO_PATH\fP file
> > +descriptors, but cannot be used with
> > +.BR AT_FDCWD
> > +(or as an argument to plain
> > +.BR open (2).) When re-opening an \fBO_PATH\fP file descriptor, the same "link
> 
> There's a formatting problem here which can be fixed by inserting a 
> newline before "When".
> 
> > +mode" restrictions apply as with re-opening through
> > +.BR proc (5)
> > +(see
> > +.BR path_resolution "(7) and " symlink (7)
> > +for more details.)
> > +.TP
> >  .B O_EXCL
> >  Ensure that this call creates the file:
> >  if this flag is specified in conjunction with
> > @@ -668,6 +683,13 @@ with
> >  (or via procfs using
> >  .BR AT_SYMLINK_FOLLOW )
> >  even if the file is not a directory.
> > +You can even "re-open" (or upgrade) an
> > +.BR O_PATH
> > +file descriptor by using
> > +.BR O_EMPTYPATH
> > +(see the section for
> > +.BR O_EMPTYPATH
> > +for more details.)
> >  .IP *
> >  Passing the file descriptor to another process via a UNIX domain socket
> >  (see
> > @@ -958,6 +980,15 @@ is not allowed.
> >  (See also
> >  .BR path_resolution (7).)
> >  .TP
> > +.B EBADF
> > +.I pathname
> > +was an empty string (and
> > +.B O_EMPTYPATH
> > +was passed) with
> > +.BR open (2)
> > +(instead of
> > +.BR openat (2).)
> > +.TP
> >  .B EDQUOT
> >  Where
> >  .B O_CREAT
> > @@ -1203,6 +1234,15 @@ The following additional errors can occur for
> >  .I dirfd
> >  is not a valid file descriptor.
> >  .TP
> > +.B EBADF
> > +.I pathname
> > +was an empty string (and
> > +.B O_EMPTYPATH
> > +was passed), but the provided
> > +.I dirfd
> > +was an invalid file descriptor (or was
> > +.BR AT_FDCWD .)
> > +.TP
> >  .B ENOTDIR
> >  .I pathname
> >  is a relative pathname and
> > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> > index 46f25ec4cdfa..85dd354e9a93 100644
> > --- a/man7/path_resolution.7
> > +++ b/man7/path_resolution.7
> > @@ -22,7 +22,7 @@
> >  .\" the source, must acknowledge the copyright and authors of this work.
> >  .\" %%%LICENSE_END
> >  .\"
> > -.TH PATH_RESOLUTION 7 2017-11-26 "Linux" "Linux Programmer's Manual"
> > +.TH PATH_RESOLUTION 7 2019-10-03 "Linux" "Linux Programmer's Manual"
> >  .SH NAME
> >  path_resolution \- how a pathname is resolved to a file
> >  .SH DESCRIPTION
> > @@ -198,6 +198,21 @@ successfully.
> >  Linux returns
> >  .B ENOENT
> >  in this case.
> > +.PP
> > +As of Linux 5.FOO, an empty path argument can be used to indicate the "re-open"
> > +an existing file descriptor if
> > +.B O_EMPTYPATH
> > +is passed as a flag argument to
> > +.BR openat (2),
> > +with the
> > +.I dfd
> > +argument indicating which file descriptor to "re-open". This is approximately
> > +equivalent to opening
> > +.I /proc/self/fd/$fd
> 
> .IR /proc/self/fd/$fd ,
> 
> > +where
> > +.I $fd
> > +is the open file descriptor to be "re-opened".
> > +
> 
> No blank line here.
> 
> >  .SS Permissions
> >  The permission bits of a file consist of three groups of three bits; see
> >  .BR chmod (1)

Will fix all of the above -- thanks!


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall
  2019-10-09  8:36   ` Michael Kerrisk (man-pages)
@ 2019-10-09 10:17     ` Aleksa Sarai
  2019-10-09 10:32       ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2019-10-09 10:17 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Al Viro, Christian Brauner, Aleksa Sarai, linux-man, linux-api,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 22645 bytes --]

On 2019-10-09, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> Hello Aleksa,
> 
> Thanks for this. It's a great piece of documentation work!
> 
> I would prefer the path_resolution(7) piece as a separate patch.

Thanks, and will do.

> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> > Rather than trying to merge the new syscall documentation into open.2
> > (which would probably result in the man-page being incomprehensible),
> > instead the new syscall gets its own dedicated page with links between
> > open(2) and openat2(2) to avoid duplicating information such as the list
> > of O_* flags or common errors.
> 
> Yes, looking at the size of the proposed openat2(2) page,
> this seems best.
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  man2/open.2            |   5 +
> >  man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
> >  man7/path_resolution.7 |  57 ++++--
> >  3 files changed, 426 insertions(+), 17 deletions(-)
> >  create mode 100644 man2/openat2.2
> > 
> > diff --git a/man2/open.2 b/man2/open.2
> > index 7217fe056e5e..a0b43394bbee 100644
> > --- a/man2/open.2
> > +++ b/man2/open.2
> > @@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
> >  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
> >  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
> >  ", mode_t " mode );
> > +.PP
> > +/* Docuented separately, in \fBopenat2\fP(2). */
> 
> Documented
> 
> > +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> > +const struct open_how *" how ", size_t " size ");
> >  .fi
> >  .PP
> >  .in -4n
> > @@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
> >  .B O_DIRECTORY
> >  is ignored).
> >  .SH SEE ALSO
> > +.BR openat2 (2),
> 
> Entries here should into alphabetical order (within
> sections).
> 
> >  .BR chmod (2),
> >  .BR chown (2),
> >  .BR close (2),
> > diff --git a/man2/openat2.2 b/man2/openat2.2
> > new file mode 100644
> > index 000000000000..c43c76046243
> > --- /dev/null
> > +++ b/man2/openat2.2
> > @@ -0,0 +1,381 @@
> > +.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
> > +.\"
> > +.\" %%%LICENSE_START(VERBATIM)
> > +.\" Permission is granted to make and distribute verbatim copies of this
> > +.\" manual provided the copyright notice and this permission notice are
> > +.\" preserved on all copies.
> > +.\"
> > +.\" Permission is granted to copy and distribute modified versions of this
> > +.\" manual under the conditions for verbatim copying, provided that the
> > +.\" entire resulting derived work is distributed under the terms of a
> > +.\" permission notice identical to this one.
> > +.\"
> > +.\" Since the Linux kernel and libraries are constantly changing, this
> > +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> > +.\" responsibility for errors or omissions, or for damages resulting from
> > +.\" the use of the information contained herein.  The author(s) may not
> > +.\" have taken the same level of care in the production of this manual,
> > +.\" which is licensed free of charge, as they might when working
> > +.\" professionally.
> > +.\"
> > +.\" Formatted or processed versions of this manual, if unaccompanied by
> > +.\" the source, must acknowledge the copyright and authors of this work.
> > +.\" %%%LICENSE_END
> > +.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +openat2 \- open and possibly create a file (extended)
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/types.h>
> > +.B #include <sys/stat.h>
> > +.B #include <fcntl.h>
> > +.PP
> > +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> > +const struct open_how *" how ", size_t " size ");
> > +.fi
> > +.PP
> > +.IR Note :
> > +There is no glibc wrapper for this system call; see NOTES.
> > +.SH DESCRIPTION
> > +The
> > +.BR openat2 ()
> > +system call is an extension of
> > +.BR openat (2)
> > +and provides a superset of its functionality. Rather than taking a single
> 
> Please start new sentences on new source lines. I recently added this
> text in man-pages(7):
> 
>    Use semantic newlines
>        In the source of a manual page, new sentences should be started on
>        new  lines,  and  long sentences should split into lines at clause
>        breaks (commas, semicolons, colons, and so on).  This  convention,
>        sometimes known as "semantic newlines", makes it easier to see the
>        effect of patches, which often operate at the level of  individual
>        sentences or sentence clauses.
> 
> > +.I flag
> > +argument, an extensible structure (\fIhow\fP) is passed instead to allow for
> > +seamless future extensions.
> 
> s/seamless//
> 
> > +.PP
> > +.I size
> > +must be set to
> > +.IR "sizeof(struct open_how)" ,
> > +to facilitate future extensions (see the "Extensibility" section of the
> > +\fBNOTES\fP for more detail on how extensions are handled.)
> > +
> > +.SS The open_how structure
> > +The following structure indicates how
> > +.I pathname
> > +should be opened, and acts as a superset of the
> > +.IR flag " and " mode
> > +arguments to
> > +.BR openat (2).
> > +.PP
> > +.in +4n
> > +.EX
> > +struct open_how {
> > +    uint32_t flags;              /* open(2)-style O_* flags. */
> > +    union {
> > +        uint16_t mode;           /* File mode bits for new file creation. */
> > +        uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
> > +    };
> > +    uint32_t resolve;            /* RESOLVE_* path-resolution flags. */
> > +};
> > +.EE
> > +.in
> > +.PP
> > +Any future extensions to
> > +.BR openat2 ()
> > +will be implemented as new fields appended to the above structure, with the
> > +zero value of the new fields acting as though the extension were not present.
> > +.PP
> > +The meaning of each field is as follows:
> > +.RS
> > +
> > +.I flags
> > +.RSall
> > +The file creation and status flags to use for this operation. All of the
> > +.B O_*
> > +flags defined for
> > +.BR openat (2)
> > +are valid
> > +.BR openat2 ()
> > +flag values.
> > +.RE
> > +
> > +.I upgrade_mask
> > +.RS
> > +Restrict with which
> > +.I access modes
> > +the returned
> > +.B O_PATH
> > +descriptor may be re-opened (either through
> > +.B O_EMPTYPATH
> > +or
> > +.IR /proc/self/fd/ .)
> > +This field may only be set to a non-zero value if
> > +.I flags
> > +contains
> > +.BR O_PATH .
> > +By default, an
> > +.B O_PATH
> > +file descriptor of an ordinary file may be re-opened with with any access mode (but an
> > +.B O_PATH
> > +file descriptor of a magic-link may only be re-opened with access modes that
> > +the original magic-link possessed). The full list of
> 
> magic link (throughout the page)
> 
> > +.I upgrade_mask
> > +flags is given below.
> > +.TP
> > +.B UPGRADE_NOREAD
> > +Do not permit the
> > +.B O_PATH
> > +file descriptor to be re-opened for reading (i.e.
> > +.BR O_RDONLY " or " O_RDWR .)
> > +.TP
> > +.B UPGRADE_NOWRITE
> > +Do not permit the
> > +.B O_PATH
> > +file descriptor to be re-opened for writing (i.e.
> > +.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
> > +.RE
> > +.I resolve
> > +.RS
> > +Change how the components of
> > +.I pathname
> > +will be resolved (see
> > +.BR path_resolution (7)
> > +for background information.) The primary use-case for these flags is to allow
> 
> use case
> 
> > +trusted programs to restrict how un-trusted paths (or paths inside un-trusted
> 
> untrusted
> 
> > +directories) are resolved. The full list of
> > +.I resolve
> > +flags is given below.
> > +.TP
> > +.B RESOLVE_NO_XDEV
> > +Disallow all mount-point crossings during path resolution (including
> 
> I think better would be: "Disallow traversal of mount points". Do you 
> agree?

Yes, that sounds better.

> > +all bind-mounts).
> 
> bind mounts
> 
> > +
> > +Users of this flag are encouraged to make its use configurable (unless it is
> > +used for a specific security purpose), as bind-mounts are very widely used by
> > +end-users and thus enabling this flag globally may result in spurious errors on
> > +some systems.
> > +.TP
> > +.B RESOLVE_NO_SYMLINKS
> > +Disallow all symlink resolution during path resolution. If the trailing
> 
> Disallow resolution of symbolic links during path resolution
> 
> > +component is a symlink, and
> 
> symbolic link (throughout the page)
> 
> > +.I flags
> > +contains both
> > +.BR O_PATH " and " O_NOFOLLOW ","
> > +then an
> > +.B O_PATH
> > +file descriptor referencing the symlink will be returned. This option implies
> > +.BR RESOLVE_NO_MAGICLINKS .
> > +
> > +Users of this flag are encouraged to make its use configurable (unless it is
> > +used for a specific security purpose), as symlinks are very widely used by
> > +end-users and thus enabling this flag globally may result in spurious errors on
> > +some systems.
> 
> It's not really clear what you mean by "enabling this flag globally".
> Could you reword, or explain in a bit more detail?

A better word might be "indiscriminately" -- the point being that if
a program uses it for every openat2() call (and users cannot disable
it), then the program will break on all sorts of systems.

> > +.TP
> > +.B RESOLVE_NO_MAGICLINKS
> > +Disallow all magic-link resolution during path resolution. If the trailing
> > +component is a magic-link, and
> > +.I flags
> > +contains both
> > +.BR O_PATH " and " O_NOFOLLOW ","
> > +then an
> > +.B O_PATH
> > +file descriptor referencing the magic-link will be returned.
> > +
> > +Magic-links are symlink-like objects that are most notably found in
> > +.BR proc (5)
> > +(examples include
> > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> > +Due to the potential danger of unknowingly opening these magic-links, it may be
> > +preferable for users to disable their resolution entirely (see
> > +.BR symlink (7)
> > +for more details.)
> > +.TP
> > +.B RESOLVE_BENEATH
> > +Do not permit the path resolution to succeed if any component of the resolution
> > +is not a descendant of the directory indicated by
> > +.IR dirfd .
> > +This results in absolute symlinks (and absolute values of
> > +.IR pathname )
> > +to be rejected. Magic-link resolution is also not permitted.
> 
> So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
> it would be good to state that more explicitly,

It does, though this might change in the future (some magic-link
resolutions might be safe -- but it's unclear what the semantics should
be). Users should explicitly set RESOLVE_NO_MAGICLINKS if they really
don't want to resolve them.

> > +
> > +.TP
> > +.B RESOLVE_IN_ROOT
> > +Temporarily treat
> > +.I dirfd
> > +as the root of the filesystem (as though the user called
> 
> Perhaps better:
> 
> Treat
> .I dirfd
> as the root directory while resolving
> .I pathname
> (as though...)

Yeah that sounds better.

> > +.BR chroot (2)
> > +with
> > +.IR dirfd
> > +as the argument.) Absolute symlinks and ".." path components will be scoped to
> > +.IR dirfd . Magic-link resolution is also not permitted.
> 
> Insert a newline before "Magic" to fix a formatting problem.
> 
> So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
> it would be good to state that more explicitly,

Same reply as above.

> > +
> > +However, unlike
> > +.BR chroot (2)
> > +(which changes the filesystem root persistently for an entire thread-group),
> 
> s/persistently for an entire thread-group/
>  /permanently for a process/
> 
> > +.B RESOLVE_IN_ROOT
> > +allows a program to efficiently restrict path resolution for only certain
> > +operations. It also has several hardening features (such as not permitting
> > +magic-link resolution) which
> > +.BR chroot (2)
> > +does not.
> > +.RE
> > +
> > +.RE
> > +
> > +.PP
> > +Unlike
> > +.BR openat (2),
> > +any unknown flags set in fields of
> > +.I how
> > +will result in an error, rather than being ignored. 
> 
> Thank you, thank you, thank you. It was sad
> that openat() never fixed that antifeature.

No problem, it's bothered me for a long time as well. :D

> > In addition, an error will
> > +be returned if the value of the
> > +.IR mode " and " upgrade_mask
> > +union is non-zero unless:
> > +.RS
> > +.IP * 3
> > +.I flags
> > +indicates that a new file will be created (it contains
> > +.BR O_CREAT " or " O_TMPFILE ),
> > +in which case
> > +.I mode
> > +may be any valid file mode.
> > +.IP *
> > +.I flags
> > +contains
> > +.BR O_PATH ,
> > +in which case
> > +.I upgrade_mask
> > +must only contain valid
> > +.B UPGRADE_*
> > +flags.
> > +.RE
> > +
> > +.SH RETURN VALUE
> > +On success, a new file descriptor is returned. On error, -1 is returned, and
> > +.I errno
> > +is set appropriately.
> > +
> > +.SH ERRORS
> > +The set of errors returned by
> > +.BR openat2 ()
> > +includes all of the errors returned by
> > +.BR openat (2),
> > +as well as the following additional errors:
> > +.TP
> > +.B EINVAL
> > +An unknown flag or invalid value was specified in
> > +.IR how .
> > +.TP
> > +.B EINVAL
> > +.I size
> > +was smaller than any known version of
> > +.IR "struct open_how" .
> > +.TP
> > +.B E2BIG
> > +An extension was specified in
> > +.IR how ,
> > +which the current kernel does not support (see the "Extensibility" section of
> > +the \fBNOTES\fP for more detail on how extensions are handled.)
> > +.TP
> > +.B EAGAIN
> > +.I resolve
> > +contains either
> > +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> > +and the kernel could not ensure that a ".." component didn't escape (due to a
> > +race condition or potential attack). Callers may choose to retry the
> > +.BR openat2 ()
> > +call.
> > +.TP
> > +.B EXDEV
> > +.I resolve
> > +contains either
> > +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> > +and a path component attempted to escape the root of the resolution.
> > +
> > +.TP
> > +.B EXDEV
> > +.I resolve
> > +contains
> > +.BR RESOLVE_NO_XDEV ,
> > +and a path component attempted to cross a mount-point.
> 
> mount point
> 
> > +
> > +.TP
> > +.B ELOOP
> > +.I resolve
> > +contains
> > +.BR RESOLVE_NO_SYMLINKS ,
> > +and one of the path components was a symlink.
> > +.TP
> > +.B ELOOP
> > +.I resolve
> > +contains
> > +.BR RESOLVE_NO_MAGICLINKS ,
> > +and one of the path components was a magic-link.
> > +
> > +.SH VERSIONS
> > +.BR openat2 ()
> > +was added to Linux in kernel 5.FOO.
> > +
> > +.SH CONFORMING TO
> > +This system call is Linux-specific.
> > +
> > +The semantics of
> > +.B RESOLVE_BENEATH
> > +were modelled after FreeBSD's
> > +.BR O_BENEATH .
> > +
> > +.SH NOTES
> > +Glibc does not provide a wrapper for this system call; call it using
> > +.BR syscall (2).
> > +
> > +.SS Extensibility
> > +In order to allow for
> > +.I struct open_how
> > +to be extended in future kernel revisions,
> > +.BR openat2 ()
> > +requires userspace to specify what sized
> 
> s/what sized/the size of/
> 
> > +.I struct open_how
> > +structure they are passing. By providing this information, it is possible for
> > +.BR openat2 ()
> > +to provide both forwards- and backwards-compatibility \(em with
> > +.I size
> > +acting as an implicit version number (because new extension fields will always
> > +be appended, the size will always increase.) This extensibility design is very
> > +similar to other system calls such as
> > +.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
> 
> The following explantion of uszie and ksize is great. Thanks for that.

Glad to hear you don't think it's too much fluff. :D

> > +If we let
> > +.I usize
> > +be the size of the structure according to userspace and
> > +.I ksize
> > +be the size of the structure which the kernel supports, then there are only
> > +three cases to consider:
> > +
> > +.RS
> > +.IP * 3
> > +If
> > +.IR ksize " equals " usize ,
> > +then there is no version mismatch and
> > +.I how
> > +can be used verbatim.
> > +.IP *
> > +If
> > +.IR ksize " is larger than " usize ,
> > +then there are some extensions the kernel supports which the userspace program
> > +is unaware of. Because all extensions must have their zero values be a no-op,
> > +the kernel treats all of the extension fields not set by userspace to have zero
> > +values. This provides backwards-compatibility.
> > +.IP *
> > +If
> > +.IR ksize " is smaller than " usize ,
> > +then there are some extensions which the userspace program is aware of but the
> > +kernel does not support. Because all extensions must have their zero values be
> > +a no-op, the kernel can safely ignore the unsupported extension fields if they
> > +are all-zero. If any unsupported extension fields are non-zero, then an error
> > +is returned. This provides forwards-compatibility.
> > +.RE
> > +
> > +Therefore, most userspace programs will not need to have any special handling
> > +of extensions. However, if a userspace program wishes to determine what
> > +extensions the running kernel supports, they may conduct a binary search on
> > +.IR size
> > +(to find the largest value which doesn't produce an error.)
> > +
> > +.SH SEE ALSO
> > +.BR openat (2),
> > +.BR path_resolution (7),
> > +.BR symlink (7)
> > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> > index 85dd354e9a93..3da3e5b614c8 100644
> > --- a/man7/path_resolution.7
> > +++ b/man7/path_resolution.7
> > @@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
> >  Some UNIX/Linux system calls have as parameter one or more filenames.
> >  A filename (or pathname) is resolved as follows.
> >  .SS Step 1: start of the resolution process
> > -If the pathname starts with the \(aq/\(aq character,
> > -the starting lookup directory
> > -is the root directory of the calling process.
> > -(A process inherits its
> > -root directory from its parent.
> > -Usually this will be the root directory
> > -of the file hierarchy.
> > -A process may get a different root directory
> > -by use of the
> > +If the pathname starts with the \(aq/\(aq character, the starting lookup
> > +directory is the root directory of the calling process. (A process inherits its
> > +root directory from its parent. Usually this will be the root directory of the
> > +file hierarchy. A process may get a different root directory by use of the
> >  .BR chroot (2)
> > -system call.
> > +system call, or may temporarily use a different root directory by using
> > +.BR openat2 (2)
> > +with the
> > +.B RESOLVE_IN_ROOT
> > +flag set.
> > +.PP
> >  A process may get an entirely private mount namespace in case
> >  it\(emor one of its ancestors\(emwas started by an invocation of the
> >  .BR clone (2)
> > @@ -48,16 +48,24 @@ system call that had the
> >  flag set.)
> >  This handles the \(aq/\(aq part of the pathname.
> >  .PP
> > -If the pathname does not start with the \(aq/\(aq character, the
> > -starting lookup directory of the resolution process is the current working
> > -directory of the process.
> > -(This is also inherited from the parent.
> > -It can be changed by use of the
> > +If the pathname does not start with the \(aq/\(aq character, the starting
> > +lookup directory of the resolution process is the current working directory of
> > +the process \(em or in the case of
> > +.BR openat (2)-style
> > +syscalls, the
> 
> system calls
> 
> > +.I dfd
> > +argument (or the current working directory if
> > +.B AT_FDCWD
> > +is passed as the
> > +.I dfd
> > +argumnet). The current working directory is inherited from the parent, and can
> 
> argument
> 
> > +be changed by use of the
> >  .BR chdir (2)
> > -system call.)
> > +syscall.
> 
> "system call" please.
> 
> >  .PP
> >  Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
> >  Pathnames not starting with a \(aq/\(aq are called relative pathnames.
> > +
> 
> No blank line here.
> 
> >  .SS Step 2: walk along the path
> >  Set the current lookup directory to the starting lookup directory.
> >  Now, for each nonfinal component of the pathname, where a component
> > @@ -124,6 +132,13 @@ the kernel's pathname-resolution code
> >  was reworked to eliminate the use of recursion,
> >  so that the only limit that remains is the maximum of 40
> >  resolutions for the entire pathname.
> > +.PP
> > +The resolution of syscalls during this stage can be blocked by using
> 
> "resolution of syscall" seems wrong? "syscall" should be something 
> else?

Yeah, should be "resolution of symlinks". ;)

> > +.BR openat2 (2),
> > +with the
> > +.B RESOLVE_NO_SYMLINKS
> > +flag set.
> > +
> >  .SS Step 3: find the final entry
> >  The lookup of the final component of the pathname goes just like
> >  that of all other components, as described in the previous step,
> > @@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
> >  their conventional meanings, regardless of whether they are
> >  actually present in the physical filesystem.
> >  .PP
> > -One cannot walk down past the root: "/.." is the same as "/".
> > +One cannot walk up past the root: "/.." is the same as "/".
> > +
> 
> No blank line please.
> 
> >  .SS Mount points
> >  After a "mount dev path" command, the pathname "path" refers to
> >  the root of the filesystem hierarchy on the device "dev", and no
> > @@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
> >  One can walk out of a mounted filesystem: "path/.." refers to
> >  the parent directory of "path",
> >  outside of the filesystem hierarchy on "dev".
> > +.PP
> > +Mount-point crossings can be blocked by using
> 
> Traversal of mount points can be disallowed by...
> 
> > +.BR openat2 (2),
> > +with the
> > +.B RESOLVE_NO_XDEV
> > +flag set (though note that this also restricts bind-mount crossings).
> > +
> 
> No blank line please.
> 
> >  .SS Trailing slashes
> >  If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
> >  component as in Step 2: it has to exist and resolve to a directory.
> > 

Thanks so much, and I'll clean up your nits.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall
  2019-10-09 10:17     ` Aleksa Sarai
@ 2019-10-09 10:32       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-10-09 10:32 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: mtk.manpages, Al Viro, Christian Brauner, Aleksa Sarai,
	linux-man, linux-api, linux-kernel

Hello Aleksa,

On 10/9/19 12:17 PM, Aleksa Sarai wrote:
> On 2019-10-09, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>> Hello Aleksa,
>>
>> Thanks for this. It's a great piece of documentation work!
>>
>> I would prefer the path_resolution(7) piece as a separate patch.
> 
> Thanks, and will do.
> 
>> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
>>> Rather than trying to merge the new syscall documentation into open.2
>>> (which would probably result in the man-page being incomprehensible),
>>> instead the new syscall gets its own dedicated page with links between
>>> open(2) and openat2(2) to avoid duplicating information such as the list
>>> of O_* flags or common errors.
>>
>> Yes, looking at the size of the proposed openat2(2) page,
>> this seems best.
>>>
>>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
>>> ---

[...]

>>> diff --git a/man2/openat2.2 b/man2/openat2.2
>>> new file mode 100644
>>> index 000000000000..c43c76046243
>>> --- /dev/null
>>> +++ b/man2/openat2.2

[...]

>>> +.TP
>>> +.B RESOLVE_NO_SYMLINKS
>>> +Disallow all symlink resolution during path resolution. If the trailing
>>
>> Disallow resolution of symbolic links during path resolution
>>
>>> +component is a symlink, and
>>
>> symbolic link (throughout the page)
>>
>>> +.I flags
>>> +contains both
>>> +.BR O_PATH " and " O_NOFOLLOW ","
>>> +then an
>>> +.B O_PATH
>>> +file descriptor referencing the symlink will be returned. This option implies
>>> +.BR RESOLVE_NO_MAGICLINKS .
>>> +
>>> +Users of this flag are encouraged to make its use configurable (unless it is
>>> +used for a specific security purpose), as symlinks are very widely used by
>>> +end-users and thus enabling this flag globally may result in spurious errors on
>>> +some systems.
>>
>> It's not really clear what you mean by "enabling this flag globally".
>> Could you reword, or explain in a bit more detail?
> 
> A better word might be "indiscriminately" -- the point being that if
> a program uses it for every openat2() call (and users cannot disable
> it), then the program will break on all sorts of systems.

Okay -- could you please amend the text to say something more like what
you just clarified.

> 
>>> +.TP
>>> +.B RESOLVE_NO_MAGICLINKS
>>> +Disallow all magic-link resolution during path resolution. If the trailing
>>> +component is a magic-link, and
>>> +.I flags
>>> +contains both
>>> +.BR O_PATH " and " O_NOFOLLOW ","
>>> +then an
>>> +.B O_PATH
>>> +file descriptor referencing the magic-link will be returned.
>>> +
>>> +Magic-links are symlink-like objects that are most notably found in
>>> +.BR proc (5)
>>> +(examples include
>>> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
>>> +Due to the potential danger of unknowingly opening these magic-links, it may be
>>> +preferable for users to disable their resolution entirely (see
>>> +.BR symlink (7)
>>> +for more details.)
>>> +.TP
>>> +.B RESOLVE_BENEATH
>>> +Do not permit the path resolution to succeed if any component of the resolution
>>> +is not a descendant of the directory indicated by
>>> +.IR dirfd .
>>> +This results in absolute symlinks (and absolute values of
>>> +.IR pathname )
>>> +to be rejected. Magic-link resolution is also not permitted.
>>
>> So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
>> it would be good to state that more explicitly,
> 
> It does, though this might change in the future (some magic-link
> resolutions might be safe -- but it's unclear what the semantics should
> be). Users should explicitly set RESOLVE_NO_MAGICLINKS if they really
> don't want to resolve them.

Okay -- I understand. Perhaps you could then at least say something like:

Currently, this flag also disable magic-link resolution. However, this
may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.

>>> +
>>> +.TP
>>> +.B RESOLVE_IN_ROOT
>>> +Temporarily treat
>>> +.I dirfd
>>> +as the root of the filesystem (as though the user called
>>
>> Perhaps better:
>>
>> Treat
>> .I dirfd
>> as the root directory while resolving
>> .I pathname
>> (as though...)
> 
> Yeah that sounds better.
> 
>>> +.BR chroot (2)
>>> +with
>>> +.IR dirfd
>>> +as the argument.) Absolute symlinks and ".." path components will be scoped to
>>> +.IR dirfd . Magic-link resolution is also not permitted.
>>
>> Insert a newline before "Magic" to fix a formatting problem.
>>
>> So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
>> it would be good to state that more explicitly,
> 
> Same reply as above.

See above :-)

[...]

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely
  2019-10-03 14:55 ` [PATCH RFC 1/3] symlink.7: document magic-links more completely Aleksa Sarai
  2019-10-07 16:36   ` Jann Horn
  2019-10-09  7:55   ` Michael Kerrisk (man-pages)
@ 2020-04-17 11:39   ` Michael Kerrisk (man-pages)
  2020-04-17 15:37     ` Aleksa Sarai
  2 siblings, 1 reply; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-04-17 11:39 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Christian Brauner, Aleksa Sarai, linux-man, Linux API, lkml

Hi Aleksa,

Re our discussion of documentation to be added for magic symlinks,
there was the patch below, which got paused. I guess this just needs a
light refresh?

Thanks,

Michael

On Thu, 3 Oct 2019 at 16:56, Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> Traditionally, magic-links have not been a well-understood topic in
> Linux. Given the new changes in their semantics (related to the link
> mode of trailing magic-links), it seems like a good opportunity to shine
> more light on magic-links and their semantics.
>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  man7/path_resolution.7 | 15 +++++++++++++++
>  man7/symlink.7         | 39 ++++++++++++++++++++++++++++++---------
>  2 files changed, 45 insertions(+), 9 deletions(-)
>
> diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> index 07664ed8faec..46f25ec4cdfa 100644
> --- a/man7/path_resolution.7
> +++ b/man7/path_resolution.7
> @@ -136,6 +136,21 @@ we are just creating it.
>  The details on the treatment
>  of the final entry are described in the manual pages of the specific
>  system calls.
> +.PP
> +Since Linux 5.FOO, if the final entry is a "magic-link" (see
> +.BR symlink (7)),
> +and the user is attempting to
> +.BR open (2)
> +it, then there is an additional permission-related restriction applied to the
> +operation: the requested access mode must not exceed the "link mode" of the
> +magic-link (unlike ordinary symlinks, magic-links have their own file mode.)
> +For example, if
> +.I /proc/[pid]/fd/[num]
> +has a link mode of
> +.BR 0500 ,
> +unprivileged users are not permitted to
> +.BR open ()
> +the magic-link for writing.
>  .SS . and ..
>  By convention, every directory has the entries "." and "..",
>  which refer to the directory itself and to its parent directory,
> diff --git a/man7/symlink.7 b/man7/symlink.7
> index 9f5bddd5dc21..33f0ec703acd 100644
> --- a/man7/symlink.7
> +++ b/man7/symlink.7
> @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
>  are outlined here.
>  It is important that site-local applications also conform to these rules,
>  so that the user interface can be as consistent as possible.
> +.SS Magic-links
> +There is a special class of symlink-like objects known as "magic-links" which
> +can be found in certain pseudo-filesystems such as
> +.BR proc (5)
> +(examples include
> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> +Unlike normal symlinks, magic-links are not resolved through
> +pathname-expansion, but instead act as direct references to the kernel's own
> +representation of a file handle. As such, these magic-links allow users to
> +access files which cannot be referenced with normal paths (such as unlinked
> +files still referenced by a running program.)
> +.PP
> +Because they can bypass ordinary
> +.BR mount_namespaces (7)-based
> +restrictions, magic-links have been used as attack vectors in various exploits.
> +As such (since Linux 5.FOO), there are additional restrictions placed on the
> +re-opening of magic-links (see
> +.BR path_resolution (7)
> +for more details.)
>  .SS Symbolic link ownership, permissions, and timestamps
>  The owner and group of an existing symbolic link can be changed
>  using
> @@ -99,16 +118,18 @@ of a symbolic link can be changed using
>  or
>  .BR lutimes (3).
>  .PP
> -On Linux, the permissions of a symbolic link are not used
> -in any operations; the permissions are always
> -0777 (read, write, and execute for all user categories),
>  .\" Linux does not currently implement an lchmod(2).
> -and can't be changed.
> -(Note that there are some "magic" symbolic links in the
> -.I /proc
> -directory tree\(emfor example, the
> -.IR /proc/[pid]/fd/*
> -files\(emthat have different permissions.)
> +On Linux, the permissions of an ordinary symbolic link are not used in any
> +operations; the permissions are always 0777 (read, write, and execute for all
> +user categories), and can't be changed.
> +.PP
> +However, magic-links do not follow this rule. They can have a non-0777 mode,
> +which is used for permission checks when the final
> +component of an
> +.BR open (2)'s
> +path is a magic-link (see
> +.BR path_resolution (7).)
> +
>  .\"
>  .\" The
>  .\" 4.4BSD
> --
> 2.23.0
>


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely
  2020-04-17 11:39   ` Michael Kerrisk (man-pages)
@ 2020-04-17 15:37     ` Aleksa Sarai
  2020-05-15 11:42       ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2020-04-17 15:37 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Al Viro, Christian Brauner, Aleksa Sarai, linux-man, Linux API, lkml

[-- Attachment #1: Type: text/plain, Size: 5174 bytes --]

On 2020-04-17, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> Hi Aleksa,
> 
> Re our discussion of documentation to be added for magic symlinks,
> there was the patch below, which got paused. I guess this just needs a
> light refresh?

Yes, this is the patch I was thinking of -- but since the whole "magic
link mode" semantics weren't in the openat2() series that was merged,
this would need a refresh. Also I feel that magic-links probably deserve
a slightly longer explanation than I gave here.

> Thanks,
> 
> Michael
> 
> On Thu, 3 Oct 2019 at 16:56, Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > Traditionally, magic-links have not been a well-understood topic in
> > Linux. Given the new changes in their semantics (related to the link
> > mode of trailing magic-links), it seems like a good opportunity to shine
> > more light on magic-links and their semantics.
> >
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  man7/path_resolution.7 | 15 +++++++++++++++
> >  man7/symlink.7         | 39 ++++++++++++++++++++++++++++++---------
> >  2 files changed, 45 insertions(+), 9 deletions(-)
> >
> > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> > index 07664ed8faec..46f25ec4cdfa 100644
> > --- a/man7/path_resolution.7
> > +++ b/man7/path_resolution.7
> > @@ -136,6 +136,21 @@ we are just creating it.
> >  The details on the treatment
> >  of the final entry are described in the manual pages of the specific
> >  system calls.
> > +.PP
> > +Since Linux 5.FOO, if the final entry is a "magic-link" (see
> > +.BR symlink (7)),
> > +and the user is attempting to
> > +.BR open (2)
> > +it, then there is an additional permission-related restriction applied to the
> > +operation: the requested access mode must not exceed the "link mode" of the
> > +magic-link (unlike ordinary symlinks, magic-links have their own file mode.)
> > +For example, if
> > +.I /proc/[pid]/fd/[num]
> > +has a link mode of
> > +.BR 0500 ,
> > +unprivileged users are not permitted to
> > +.BR open ()
> > +the magic-link for writing.
> >  .SS . and ..
> >  By convention, every directory has the entries "." and "..",
> >  which refer to the directory itself and to its parent directory,
> > diff --git a/man7/symlink.7 b/man7/symlink.7
> > index 9f5bddd5dc21..33f0ec703acd 100644
> > --- a/man7/symlink.7
> > +++ b/man7/symlink.7
> > @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
> >  are outlined here.
> >  It is important that site-local applications also conform to these rules,
> >  so that the user interface can be as consistent as possible.
> > +.SS Magic-links
> > +There is a special class of symlink-like objects known as "magic-links" which
> > +can be found in certain pseudo-filesystems such as
> > +.BR proc (5)
> > +(examples include
> > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> > +Unlike normal symlinks, magic-links are not resolved through
> > +pathname-expansion, but instead act as direct references to the kernel's own
> > +representation of a file handle. As such, these magic-links allow users to
> > +access files which cannot be referenced with normal paths (such as unlinked
> > +files still referenced by a running program.)
> > +.PP
> > +Because they can bypass ordinary
> > +.BR mount_namespaces (7)-based
> > +restrictions, magic-links have been used as attack vectors in various exploits.
> > +As such (since Linux 5.FOO), there are additional restrictions placed on the
> > +re-opening of magic-links (see
> > +.BR path_resolution (7)
> > +for more details.)
> >  .SS Symbolic link ownership, permissions, and timestamps
> >  The owner and group of an existing symbolic link can be changed
> >  using
> > @@ -99,16 +118,18 @@ of a symbolic link can be changed using
> >  or
> >  .BR lutimes (3).
> >  .PP
> > -On Linux, the permissions of a symbolic link are not used
> > -in any operations; the permissions are always
> > -0777 (read, write, and execute for all user categories),
> >  .\" Linux does not currently implement an lchmod(2).
> > -and can't be changed.
> > -(Note that there are some "magic" symbolic links in the
> > -.I /proc
> > -directory tree\(emfor example, the
> > -.IR /proc/[pid]/fd/*
> > -files\(emthat have different permissions.)
> > +On Linux, the permissions of an ordinary symbolic link are not used in any
> > +operations; the permissions are always 0777 (read, write, and execute for all
> > +user categories), and can't be changed.
> > +.PP
> > +However, magic-links do not follow this rule. They can have a non-0777 mode,
> > +which is used for permission checks when the final
> > +component of an
> > +.BR open (2)'s
> > +path is a magic-link (see
> > +.BR path_resolution (7).)
> > +
> >  .\"
> >  .\" The
> >  .\" 4.4BSD
> > --
> > 2.23.0
> >
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely
  2020-04-17 15:37     ` Aleksa Sarai
@ 2020-05-15 11:42       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 18+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-05-15 11:42 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Christian Brauner, Aleksa Sarai, linux-man, Linux API, lkml

Hi Aleksa,

Did you have a chance to look into writing this patch?

Thanks,

Michael

On Fri, 17 Apr 2020 at 17:37, Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2020-04-17, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> > Hi Aleksa,
> >
> > Re our discussion of documentation to be added for magic symlinks,
> > there was the patch below, which got paused. I guess this just needs a
> > light refresh?
>
> Yes, this is the patch I was thinking of -- but since the whole "magic
> link mode" semantics weren't in the openat2() series that was merged,
> this would need a refresh. Also I feel that magic-links probably deserve
> a slightly longer explanation than I gave here.
>
> > Thanks,
> >
> > Michael
> >
> > On Thu, 3 Oct 2019 at 16:56, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > >
> > > Traditionally, magic-links have not been a well-understood topic in
> > > Linux. Given the new changes in their semantics (related to the link
> > > mode of trailing magic-links), it seems like a good opportunity to shine
> > > more light on magic-links and their semantics.
> > >
> > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > > ---
> > >  man7/path_resolution.7 | 15 +++++++++++++++
> > >  man7/symlink.7         | 39 ++++++++++++++++++++++++++++++---------
> > >  2 files changed, 45 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> > > index 07664ed8faec..46f25ec4cdfa 100644
> > > --- a/man7/path_resolution.7
> > > +++ b/man7/path_resolution.7
> > > @@ -136,6 +136,21 @@ we are just creating it.
> > >  The details on the treatment
> > >  of the final entry are described in the manual pages of the specific
> > >  system calls.
> > > +.PP
> > > +Since Linux 5.FOO, if the final entry is a "magic-link" (see
> > > +.BR symlink (7)),
> > > +and the user is attempting to
> > > +.BR open (2)
> > > +it, then there is an additional permission-related restriction applied to the
> > > +operation: the requested access mode must not exceed the "link mode" of the
> > > +magic-link (unlike ordinary symlinks, magic-links have their own file mode.)
> > > +For example, if
> > > +.I /proc/[pid]/fd/[num]
> > > +has a link mode of
> > > +.BR 0500 ,
> > > +unprivileged users are not permitted to
> > > +.BR open ()
> > > +the magic-link for writing.
> > >  .SS . and ..
> > >  By convention, every directory has the entries "." and "..",
> > >  which refer to the directory itself and to its parent directory,
> > > diff --git a/man7/symlink.7 b/man7/symlink.7
> > > index 9f5bddd5dc21..33f0ec703acd 100644
> > > --- a/man7/symlink.7
> > > +++ b/man7/symlink.7
> > > @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
> > >  are outlined here.
> > >  It is important that site-local applications also conform to these rules,
> > >  so that the user interface can be as consistent as possible.
> > > +.SS Magic-links
> > > +There is a special class of symlink-like objects known as "magic-links" which
> > > +can be found in certain pseudo-filesystems such as
> > > +.BR proc (5)
> > > +(examples include
> > > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> > > +Unlike normal symlinks, magic-links are not resolved through
> > > +pathname-expansion, but instead act as direct references to the kernel's own
> > > +representation of a file handle. As such, these magic-links allow users to
> > > +access files which cannot be referenced with normal paths (such as unlinked
> > > +files still referenced by a running program.)
> > > +.PP
> > > +Because they can bypass ordinary
> > > +.BR mount_namespaces (7)-based
> > > +restrictions, magic-links have been used as attack vectors in various exploits.
> > > +As such (since Linux 5.FOO), there are additional restrictions placed on the
> > > +re-opening of magic-links (see
> > > +.BR path_resolution (7)
> > > +for more details.)
> > >  .SS Symbolic link ownership, permissions, and timestamps
> > >  The owner and group of an existing symbolic link can be changed
> > >  using
> > > @@ -99,16 +118,18 @@ of a symbolic link can be changed using
> > >  or
> > >  .BR lutimes (3).
> > >  .PP
> > > -On Linux, the permissions of a symbolic link are not used
> > > -in any operations; the permissions are always
> > > -0777 (read, write, and execute for all user categories),
> > >  .\" Linux does not currently implement an lchmod(2).
> > > -and can't be changed.
> > > -(Note that there are some "magic" symbolic links in the
> > > -.I /proc
> > > -directory tree\(emfor example, the
> > > -.IR /proc/[pid]/fd/*
> > > -files\(emthat have different permissions.)
> > > +On Linux, the permissions of an ordinary symbolic link are not used in any
> > > +operations; the permissions are always 0777 (read, write, and execute for all
> > > +user categories), and can't be changed.
> > > +.PP
> > > +However, magic-links do not follow this rule. They can have a non-0777 mode,
> > > +which is used for permission checks when the final
> > > +component of an
> > > +.BR open (2)'s
> > > +path is a magic-link (see
> > > +.BR path_resolution (7).)
> > > +
> > >  .\"
> > >  .\" The
> > >  .\" 4.4BSD
> > > --
> > > 2.23.0
> > >
> >
> >
> > --
> > Michael Kerrisk
> > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > Linux/UNIX System Programming Training: http://man7.org/training/
>
>
> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-05-15 11:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-03 14:55 [PATCH RFC 0/3] document openat2(2) patch series Aleksa Sarai
2019-10-03 14:55 ` [PATCH RFC 1/3] symlink.7: document magic-links more completely Aleksa Sarai
2019-10-07 16:36   ` Jann Horn
2019-10-08  1:33     ` Aleksa Sarai
2019-10-09  7:55   ` Michael Kerrisk (man-pages)
2019-10-09  9:57     ` Aleksa Sarai
2020-04-17 11:39   ` Michael Kerrisk (man-pages)
2020-04-17 15:37     ` Aleksa Sarai
2020-05-15 11:42       ` Michael Kerrisk (man-pages)
2019-10-03 14:55 ` [PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation Aleksa Sarai
2019-10-09  8:01   ` Michael Kerrisk (man-pages)
2019-10-09 10:00     ` Aleksa Sarai
2019-10-03 14:55 ` [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall Aleksa Sarai
2019-10-09  8:36   ` Michael Kerrisk (man-pages)
2019-10-09 10:17     ` Aleksa Sarai
2019-10-09 10:32       ` Michael Kerrisk (man-pages)
2019-10-03 14:55 ` [PATCH RFC 3/3] openat2.2: document new syscall Aleksa Sarai
2019-10-03 15:00   ` Aleksa Sarai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).