Ignore this one (it's an older version of the openat2.2 patch) -- I sent it by accident. On 2019-10-04, Aleksa Sarai wrote: > Signed-off-by: Aleksa Sarai > --- > man2/open.2 | 5 + > man2/openat2.2 | 381 +++++++++++++++++++++++++++++++++++++++++ > man7/path_resolution.7 | 57 ++++-- > 3 files changed, 426 insertions(+), 17 deletions(-) > create mode 100644 man2/openat2.2 > > diff --git a/man2/open.2 b/man2/open.2 > index 7217fe056e5e..a0b43394bbee 100644 > --- a/man2/open.2 > +++ b/man2/open.2 > @@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file > .BI "int openat(int " dirfd ", const char *" pathname ", int " flags ); > .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \ > ", mode_t " mode ); > +.PP > +/* Docuented separately, in \fBopenat2\fP(2). */ > +.BI "int openat2(int " dirfd ", const char *" pathname ", \ > +const struct open_how *" how ", size_t " size "); > .fi > .PP > .in -4n > @@ -1808,6 +1812,7 @@ will create a regular file (i.e., > .B O_DIRECTORY > is ignored). > .SH SEE ALSO > +.BR openat2 (2), > .BR chmod (2), > .BR chown (2), > .BR close (2), > diff --git a/man2/openat2.2 b/man2/openat2.2 > new file mode 100644 > index 000000000000..c43c76046243 > --- /dev/null > +++ b/man2/openat2.2 > @@ -0,0 +1,381 @@ > +.\" Copyright (C) 2019 Aleksa Sarai > +.\" > +.\" %%%LICENSE_START(VERBATIM) > +.\" Permission is granted to make and distribute verbatim copies of this > +.\" manual provided the copyright notice and this permission notice are > +.\" preserved on all copies. > +.\" > +.\" Permission is granted to copy and distribute modified versions of this > +.\" manual under the conditions for verbatim copying, provided that the > +.\" entire resulting derived work is distributed under the terms of a > +.\" permission notice identical to this one. > +.\" > +.\" Since the Linux kernel and libraries are constantly changing, this > +.\" manual page may be incorrect or out-of-date. The author(s) assume no > +.\" responsibility for errors or omissions, or for damages resulting from > +.\" the use of the information contained herein. The author(s) may not > +.\" have taken the same level of care in the production of this manual, > +.\" which is licensed free of charge, as they might when working > +.\" professionally. > +.\" > +.\" Formatted or processed versions of this manual, if unaccompanied by > +.\" the source, must acknowledge the copyright and authors of this work. > +.\" %%%LICENSE_END > +.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual" > +.SH NAME > +openat2 \- open and possibly create a file (extended) > +.SH SYNOPSIS > +.nf > +.B #include > +.B #include > +.B #include > +.PP > +.BI "int openat2(int " dirfd ", const char *" pathname ", \ > +const struct open_how *" how ", size_t " size "); > +.fi > +.PP > +.IR Note : > +There is no glibc wrapper for this system call; see NOTES. > +.SH DESCRIPTION > +The > +.BR openat2 () > +system call is an extension of > +.BR openat (2) > +and provides a superset of its functionality. Rather than taking a single > +.I flag > +argument, an extensible structure (\fIhow\fP) is passed instead to allow for > +seamless future extensions. > +.PP > +.I size > +must be set to > +.IR "sizeof(struct open_how)" , > +to facilitate future extensions (see the "Extensibility" section of the > +\fBNOTES\fP for more detail on how extensions are handled.) > + > +.SS The open_how structure > +The following structure indicates how > +.I pathname > +should be opened, and acts as a superset of the > +.IR flag " and " mode > +arguments to > +.BR openat (2). > +.PP > +.in +4n > +.EX > +struct open_how { > + uint32_t flags; /* open(2)-style O_* flags. */ > + union { > + uint16_t mode; /* File mode bits for new file creation. */ > + uint16_t upgrade_mask; /* Restrict how O_PATHs may be re-opened. */ > + }; > + uint32_t resolve; /* RESOLVE_* path-resolution flags. */ > +}; > +.EE > +.in > +.PP > +Any future extensions to > +.BR openat2 () > +will be implemented as new fields appended to the above structure, with the > +zero value of the new fields acting as though the extension were not present. > +.PP > +The meaning of each field is as follows: > +.RS > + > +.I flags > +.RS > +The file creation and status flags to use for this operation. All of the > +.B O_* > +flags defined for > +.BR openat (2) > +are valid > +.BR openat2 () > +flag values. > +.RE > + > +.I upgrade_mask > +.RS > +Restrict with which > +.I access modes > +the returned > +.B O_PATH > +descriptor may be re-opened (either through > +.B O_EMPTYPATH > +or > +.IR /proc/self/fd/ .) > +This field may only be set to a non-zero value if > +.I flags > +contains > +.BR O_PATH . > +By default, an > +.B O_PATH > +file descriptor of an ordinary file may be re-opened with with any access mode (but an > +.B O_PATH > +file descriptor of a magic-link may only be re-opened with access modes that > +the original magic-link possessed). The full list of > +.I upgrade_mask > +flags is given below. > +.TP > +.B UPGRADE_NOREAD > +Do not permit the > +.B O_PATH > +file descriptor to be re-opened for reading (i.e. > +.BR O_RDONLY " or " O_RDWR .) > +.TP > +.B UPGRADE_NOWRITE > +Do not permit the > +.B O_PATH > +file descriptor to be re-opened for writing (i.e. > +.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .) > +.RE > + > +.I resolve > +.RS > +Change how the components of > +.I pathname > +will be resolved (see > +.BR path_resolution (7) > +for background information.) The primary use-case for these flags is to allow > +trusted programs to restrict how un-trusted paths (or paths inside un-trusted > +directories) are resolved. The full list of > +.I resolve > +flags is given below. > +.TP > +.B RESOLVE_NO_XDEV > +Disallow all mount-point crossings during path resolution (including > +all bind-mounts). > + > +Users of this flag are encouraged to make its use configurable (unless it is > +used for a specific security purpose), as bind-mounts are very widely used by > +end-users and thus enabling this flag globally may result in spurious errors on > +some systems. > +.TP > +.B RESOLVE_NO_SYMLINKS > +Disallow all symlink resolution during path resolution. If the trailing > +component is a symlink, and > +.I flags > +contains both > +.BR O_PATH " and " O_NOFOLLOW "," > +then an > +.B O_PATH > +file descriptor referencing the symlink will be returned. This option implies > +.BR RESOLVE_NO_MAGICLINKS . > + > +Users of this flag are encouraged to make its use configurable (unless it is > +used for a specific security purpose), as symlinks are very widely used by > +end-users and thus enabling this flag globally may result in spurious errors on > +some systems. > +.TP > +.B RESOLVE_NO_MAGICLINKS > +Disallow all magic-link resolution during path resolution. If the trailing > +component is a magic-link, and > +.I flags > +contains both > +.BR O_PATH " and " O_NOFOLLOW "," > +then an > +.B O_PATH > +file descriptor referencing the magic-link will be returned. > + > +Magic-links are symlink-like objects that are most notably found in > +.BR proc (5) > +(examples include > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .) > +Due to the potential danger of unknowingly opening these magic-links, it may be > +preferable for users to disable their resolution entirely (see > +.BR symlink (7) > +for more details.) > +.TP > +.B RESOLVE_BENEATH > +Do not permit the path resolution to succeed if any component of the resolution > +is not a descendant of the directory indicated by > +.IR dirfd . > +This results in absolute symlinks (and absolute values of > +.IR pathname ) > +to be rejected. Magic-link resolution is also not permitted. > + > +.TP > +.B RESOLVE_IN_ROOT > +Temporarily treat > +.I dirfd > +as the root of the filesystem (as though the user called > +.BR chroot (2) > +with > +.IR dirfd > +as the argument.) Absolute symlinks and ".." path components will be scoped to > +.IR dirfd . Magic-link resolution is also not permitted. > + > +However, unlike > +.BR chroot (2) > +(which changes the filesystem root persistently for an entire thread-group), > +.B RESOLVE_IN_ROOT > +allows a program to efficiently restrict path resolution for only certain > +operations. It also has several hardening features (such as not permitting > +magic-link resolution) which > +.BR chroot (2) > +does not. > +.RE > + > +.RE > + > +.PP > +Unlike > +.BR openat (2), > +any unknown flags set in fields of > +.I how > +will result in an error, rather than being ignored. In addition, an error will > +be returned if the value of the > +.IR mode " and " upgrade_mask > +union is non-zero unless: > +.RS > +.IP * 3 > +.I flags > +indicates that a new file will be created (it contains > +.BR O_CREAT " or " O_TMPFILE ), > +in which case > +.I mode > +may be any valid file mode. > +.IP * > +.I flags > +contains > +.BR O_PATH , > +in which case > +.I upgrade_mask > +must only contain valid > +.B UPGRADE_* > +flags. > +.RE > + > +.SH RETURN VALUE > +On success, a new file descriptor is returned. On error, -1 is returned, and > +.I errno > +is set appropriately. > + > +.SH ERRORS > +The set of errors returned by > +.BR openat2 () > +includes all of the errors returned by > +.BR openat (2), > +as well as the following additional errors: > +.TP > +.B EINVAL > +An unknown flag or invalid value was specified in > +.IR how . > +.TP > +.B EINVAL > +.I size > +was smaller than any known version of > +.IR "struct open_how" . > +.TP > +.B E2BIG > +An extension was specified in > +.IR how , > +which the current kernel does not support (see the "Extensibility" section of > +the \fBNOTES\fP for more detail on how extensions are handled.) > +.TP > +.B EAGAIN > +.I resolve > +contains either > +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH , > +and the kernel could not ensure that a ".." component didn't escape (due to a > +race condition or potential attack). Callers may choose to retry the > +.BR openat2 () > +call. > +.TP > +.B EXDEV > +.I resolve > +contains either > +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH , > +and a path component attempted to escape the root of the resolution. > + > +.TP > +.B EXDEV > +.I resolve > +contains > +.BR RESOLVE_NO_XDEV , > +and a path component attempted to cross a mount-point. > + > +.TP > +.B ELOOP > +.I resolve > +contains > +.BR RESOLVE_NO_SYMLINKS , > +and one of the path components was a symlink. > +.TP > +.B ELOOP > +.I resolve > +contains > +.BR RESOLVE_NO_MAGICLINKS , > +and one of the path components was a magic-link. > + > +.SH VERSIONS > +.BR openat2 () > +was added to Linux in kernel 5.FOO. > + > +.SH CONFORMING TO > +This system call is Linux-specific. > + > +The semantics of > +.B RESOLVE_BENEATH > +were modelled after FreeBSD's > +.BR O_BENEATH . > + > +.SH NOTES > +Glibc does not provide a wrapper for this system call; call it using > +.BR syscall (2). > + > +.SS Extensibility > +In order to allow for > +.I struct open_how > +to be extended in future kernel revisions, > +.BR openat2 () > +requires userspace to specify what sized > +.I struct open_how > +structure they are passing. By providing this information, it is possible for > +.BR openat2 () > +to provide both forwards- and backwards-compatibility \(em with > +.I size > +acting as an implicit version number (because new extension fields will always > +be appended, the size will always increase.) This extensibility design is very > +similar to other system calls such as > +.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3). > + > +If we let > +.I usize > +be the size of the structure according to userspace and > +.I ksize > +be the size of the structure which the kernel supports, then there are only > +three cases to consider: > + > +.RS > +.IP * 3 > +If > +.IR ksize " equals " usize , > +then there is no version mismatch and > +.I how > +can be used verbatim. > +.IP * > +If > +.IR ksize " is larger than " usize , > +then there are some extensions the kernel supports which the userspace program > +is unaware of. Because all extensions must have their zero values be a no-op, > +the kernel treats all of the extension fields not set by userspace to have zero > +values. This provides backwards-compatibility. > +.IP * > +If > +.IR ksize " is smaller than " usize , > +then there are some extensions which the userspace program is aware of but the > +kernel does not support. Because all extensions must have their zero values be > +a no-op, the kernel can safely ignore the unsupported extension fields if they > +are all-zero. If any unsupported extension fields are non-zero, then an error > +is returned. This provides forwards-compatibility. > +.RE > + > +Therefore, most userspace programs will not need to have any special handling > +of extensions. However, if a userspace program wishes to determine what > +extensions the running kernel supports, they may conduct a binary search on > +.IR size > +(to find the largest value which doesn't produce an error.) > + > +.SH SEE ALSO > +.BR openat (2), > +.BR path_resolution (7), > +.BR symlink (7) > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7 > index 85dd354e9a93..3da3e5b614c8 100644 > --- a/man7/path_resolution.7 > +++ b/man7/path_resolution.7 > @@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file > Some UNIX/Linux system calls have as parameter one or more filenames. > A filename (or pathname) is resolved as follows. > .SS Step 1: start of the resolution process > -If the pathname starts with the \(aq/\(aq character, > -the starting lookup directory > -is the root directory of the calling process. > -(A process inherits its > -root directory from its parent. > -Usually this will be the root directory > -of the file hierarchy. > -A process may get a different root directory > -by use of the > +If the pathname starts with the \(aq/\(aq character, the starting lookup > +directory is the root directory of the calling process. (A process inherits its > +root directory from its parent. Usually this will be the root directory of the > +file hierarchy. A process may get a different root directory by use of the > .BR chroot (2) > -system call. > +system call, or may temporarily use a different root directory by using > +.BR openat2 (2) > +with the > +.B RESOLVE_IN_ROOT > +flag set. > +.PP > A process may get an entirely private mount namespace in case > it\(emor one of its ancestors\(emwas started by an invocation of the > .BR clone (2) > @@ -48,16 +48,24 @@ system call that had the > flag set.) > This handles the \(aq/\(aq part of the pathname. > .PP > -If the pathname does not start with the \(aq/\(aq character, the > -starting lookup directory of the resolution process is the current working > -directory of the process. > -(This is also inherited from the parent. > -It can be changed by use of the > +If the pathname does not start with the \(aq/\(aq character, the starting > +lookup directory of the resolution process is the current working directory of > +the process \(em or in the case of > +.BR openat (2)-style > +syscalls, the > +.I dfd > +argument (or the current working directory if > +.B AT_FDCWD > +is passed as the > +.I dfd > +argumnet). The current working directory is inherited from the parent, and can > +be changed by use of the > .BR chdir (2) > -system call.) > +syscall. > .PP > Pathnames starting with a \(aq/\(aq character are called absolute pathnames. > Pathnames not starting with a \(aq/\(aq are called relative pathnames. > + > .SS Step 2: walk along the path > Set the current lookup directory to the starting lookup directory. > Now, for each nonfinal component of the pathname, where a component > @@ -124,6 +132,13 @@ the kernel's pathname-resolution code > was reworked to eliminate the use of recursion, > so that the only limit that remains is the maximum of 40 > resolutions for the entire pathname. > +.PP > +The resolution of syscalls during this stage can be blocked by using > +.BR openat2 (2), > +with the > +.B RESOLVE_NO_SYMLINKS > +flag set. > + > .SS Step 3: find the final entry > The lookup of the final component of the pathname goes just like > that of all other components, as described in the previous step, > @@ -160,7 +175,8 @@ The path resolution process will assume that these entries have > their conventional meanings, regardless of whether they are > actually present in the physical filesystem. > .PP > -One cannot walk down past the root: "/.." is the same as "/". > +One cannot walk up past the root: "/.." is the same as "/". > + > .SS Mount points > After a "mount dev path" command, the pathname "path" refers to > the root of the filesystem hierarchy on the device "dev", and no > @@ -169,6 +185,13 @@ longer to whatever it referred to earlier. > One can walk out of a mounted filesystem: "path/.." refers to > the parent directory of "path", > outside of the filesystem hierarchy on "dev". > +.PP > +Mount-point crossings can be blocked by using > +.BR openat2 (2), > +with the > +.B RESOLVE_NO_XDEV > +flag set (though note that this also restricts bind-mount crossings). > + > .SS Trailing slashes > If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding > component as in Step 2: it has to exist and resolve to a directory. > -- > 2.23.0 > -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH