All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6] close_range.2: new page documenting close_range(2)
@ 2021-01-23 16:11 Stephen Kitt
  2021-01-28 20:50 ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Kitt @ 2021-01-23 16:11 UTC (permalink / raw)
  To: linux-man, Alejandro Colomar, Michael Kerrisk
  Cc: Christian Brauner, Giuseppe Scrivano, linux-kernel, Stephen Kitt

This documents close_range(2) based on information in
278a5fbaed89dacd04e9d052f4594ffd0e0585de,
60997c3d45d9a67daf01c56d805ae4fec37e0bd8, and
582f1fb6b721facf04848d2ca57f34468da1813e.

Signed-off-by: Stephen Kitt <steve@sk2.org>
---
V6: bit mask, close-on-exec flag language improvements
    another close(2) reference
    only include one example program
    ensure the example code doesn't wrap

V5: clarification of the open/close_range/execve sequence

V4: sort flags alphabetically
    move commit references inside the corresponding section
    more semantic newlines
    unformat numeric constants
    more formatting for function references
    escape C backslashes
    C99 loop indices

V3: fix synopsis overflow
    copy notes from membarrier.2 re the lack of wrapper
    semantic newlines
    drop non-standard "USE CASES" section heading
    add code example

V2: unsigned int to match the kernel declarations
    groff and grammar tweaks
    CLOSE_RANGE_UNSHARE unshares *and* closes
    Explain that EMFILE and ENOMEM can occur with C_R_U
    "Conforming to" phrasing
    Detailed explanation of CLOSE_RANGE_UNSHARE
    Reading /proc isn't common

 man2/close_range.2 | 236 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 236 insertions(+)
 create mode 100644 man2/close_range.2

diff --git a/man2/close_range.2 b/man2/close_range.2
new file mode 100644
index 000000000..5abb73990
--- /dev/null
+++ b/man2/close_range.2
@@ -0,0 +1,236 @@
+.\" Copyright (c) 2020 Stephen Kitt <steve@sk2.org>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH CLOSE_RANGE 2 2020-12-08 "Linux" "Linux Programmer's Manual"
+.SH NAME
+close_range \- close all file descriptors in a given range
+.SH SYNOPSIS
+.nf
+.B #include <linux/close_range.h>
+.PP
+.BI "int close_range(unsigned int " first ", unsigned int " last ,
+.BI "                unsigned int " flags );
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR close_range ()
+system call closes all open file descriptors from
+.I first
+to
+.I last
+(included).
+.PP
+Errors closing a given file descriptor are currently ignored.
+.PP
+.I flags
+is a bit mask containing 0 or more of the following:
+.TP
+.BR CLOSE_RANGE_CLOEXEC " (since Linux 5.11)"
+sets the file descriptor's close-on-exec flag instead of
+immediately closing the file descriptors.
+.TP
+.B CLOSE_RANGE_UNSHARE
+unshares the range of file descriptors from any other processes,
+before closing them,
+avoiding races with other threads sharing the file descriptor table.
+.SH RETURN VALUE
+On success,
+.BR close_range ()
+returns 0.
+On error, \-1 is returned and
+.I errno
+is set to indicate the cause of the error.
+.SH ERRORS
+.TP
+.B EINVAL
+.I flags
+is not valid, or
+.I first
+is greater than
+.IR last .
+.PP
+The following can occur with
+.B CLOSE_RANGE_UNSHARE
+(when constructing the new descriptor table):
+.TP
+.B EMFILE
+The per-process limit on the number of open file descriptors has been reached
+(see the description of
+.B RLIMIT_NOFILE
+in
+.BR getrlimit (2)).
+.TP
+.B ENOMEM
+Insufficient kernel memory was available.
+.SH VERSIONS
+.BR close_range ()
+first appeared in Linux 5.9.
+.SH CONFORMING TO
+.BR close_range ()
+is a nonstandard function that is also present on FreeBSD.
+.SH NOTES
+Glibc does not provide a wrapper for this system call; call it using
+.BR syscall (2).
+.SS Closing all open file descriptors
+.\" 278a5fbaed89dacd04e9d052f4594ffd0e0585de
+To avoid blindly closing file descriptors
+in the range of possible file descriptors,
+this is sometimes implemented (on Linux)
+by listing open file descriptors in
+.I /proc/self/fd/
+and calling
+.BR close (2)
+on each one.
+.BR close_range ()
+can take care of this without requiring
+.I /proc
+and within a single system call,
+which provides significant performance benefits.
+.SS Closing file descriptors before exec
+.\" 60997c3d45d9a67daf01c56d805ae4fec37e0bd8
+File descriptors can be closed safely using
+.PP
+.in +4n
+.EX
+/* we don't want anything past stderr here */
+close_range(3, ~0U, CLOSE_RANGE_UNSHARE);
+execve(....);
+.EE
+.in
+.PP
+.B CLOSE_RANGE_UNSHARE
+is conceptually equivalent to
+.PP
+.in +4n
+.EX
+unshare(CLONE_FILES);
+close_range(first, last, 0);
+.EE
+.in
+.PP
+but can be more efficient:
+if the unshared range extends past
+the current maximum number of file descriptors allocated
+in the caller's file descriptor table
+(the common case when
+.I last
+is ~0U),
+the kernel will unshare a new file descriptor table for the caller up to
+.IR first .
+This avoids subsequent
+.BR close (2)
+calls entirely;
+the whole operation is complete once the table is unshared.
+.SS Closing files on \fBexec\fP
+.\" 582f1fb6b721facf04848d2ca57f34468da1813e
+This is particularly useful in cases where multiple
+.RB pre- exec
+setup steps risk conflicting with each other.
+For example, setting up a
+.BR seccomp (2)
+profile can conflict with a
+.BR close_range ()
+call:
+if the file descriptors are closed before the
+.BR seccomp (2)
+profile is set up,
+the profile setup can't use them itself,
+or control their closure;
+if the file descriptors are closed afterwards,
+the seccomp profile can't block the
+.BR close_range ()
+call or any fallbacks.
+Using
+.B CLOSE_RANGE_CLOEXEC
+avoids this:
+the descriptors can be marked before the
+.BR seccomp (2)
+profile is set up,
+and the profile can control access to
+.BR close_range ()
+without affecting the calling process.
+.SH EXAMPLES
+The following program executes the command given on its command-line,
+after opening the files listed after the command and then using
+.BR close_range ()
+to close them:
+.PP
+.in +4n
+.EX
+/* close_range.c */
+
+#include <fcntl.h>
+#include <linux/close_range.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+int
+main(int argc, char *argv[])
+{
+    char *newargv[] = { NULL };
+    char *newenviron[] = { NULL };
+
+    if (argc < 3) {
+        fprintf(stderr, "Usage: %s <command> <file>...\en", argv[0]);
+        exit(EXIT_FAILURE);
+    }
+
+    for (int i = 2; i < argc; i++) {
+        if (open(argv[i], O_RDONLY) == -1) {
+            perror(argv[i]);
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    if (syscall(__NR_close_range, 3, ~0U, 0) == -1) {
+        perror("close_range");
+        exit(EXIT_FAILURE);
+    }
+
+    execve(argv[1], newargv, newenviron);
+    perror("execve");
+    exit(EXIT_FAILURE);
+}
+.EE
+.in
+.PP
+Running any program with the above, with files to open:
+.PP
+.in +4n
+.EX
+.RB "$" " ./close_range " <program> " /dev/null /dev/zero"
+.EE
+.in
+.PP
+and inspecting the open files in the resulting process will show that
+the files have indeed been closed.
+.SH SEE ALSO
+.BR close (2)

base-commit: fb0d03d11cec04da7720a80a1373605d81fbb432
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v6] close_range.2: new page documenting close_range(2)
  2021-01-23 16:11 [PATCH v6] close_range.2: new page documenting close_range(2) Stephen Kitt
@ 2021-01-28 20:50 ` Michael Kerrisk (man-pages)
  2021-01-28 22:10   ` Stephen Kitt
                     ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-01-28 20:50 UTC (permalink / raw)
  To: Stephen Kitt, linux-man, Alejandro Colomar
  Cc: mtk.manpages, Christian Brauner, Giuseppe Scrivano, linux-kernel

Hello Stephen, (and CHristian, please!)


Thanks for your patch revision. I've merged it, and have
done some light editing, but I still have a question:

On 1/23/21 5:11 PM, Stephen Kitt wrote:

[...]

> +.SH ERRORS

> +.TP
> +.B EMFILE
> +The per-process limit on the number of open file descriptors has been reached
> +(see the description of
> +.B RLIMIT_NOFILE
> +in
> +.BR getrlimit (2)).

I think there was already a question about this error, but
I still have a doubt.

A glance at the code tells me that indeed EMFILE can occur.
But how can the reason be because the limit on the number
of open file descriptors has been reached? I mean: no new
FDs are being opened, so how can we go over the limit. I think
the cause of this error is something else, but what is it?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v6] close_range.2: new page documenting close_range(2)
  2021-01-28 20:50 ` Michael Kerrisk (man-pages)
@ 2021-01-28 22:10   ` Stephen Kitt
       [not found]     ` <20210129100024.m4bil5mz5prry4iq@wittgenstein>
  2021-01-29 10:01   ` Christian Brauner
  2021-03-09 19:53   ` Stephen Kitt
  2 siblings, 1 reply; 8+ messages in thread
From: Stephen Kitt @ 2021-01-28 22:10 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-man, Alejandro Colomar, Christian Brauner,
	Giuseppe Scrivano, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]

Hello Michael,

On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)"
<mtk.manpages@gmail.com> wrote:
> Thanks for your patch revision. I've merged it, and have
> done some light editing, but I still have a question:
> 
> On 1/23/21 5:11 PM, Stephen Kitt wrote:
> 
> [...]
> 
> > +.SH ERRORS  
> 
> > +.TP
> > +.B EMFILE
> > +The per-process limit on the number of open file descriptors has been
> > reached +(see the description of
> > +.B RLIMIT_NOFILE
> > +in
> > +.BR getrlimit (2)).  
> 
> I think there was already a question about this error, but
> I still have a doubt.
> 
> A glance at the code tells me that indeed EMFILE can occur.
> But how can the reason be because the limit on the number
> of open file descriptors has been reached? I mean: no new
> FDs are being opened, so how can we go over the limit. I think
> the cause of this error is something else, but what is it?

Here’s how I understand the code that can lead to EMFILE:

* in __close_range(), if CLOSE_RANGE_UNSHARE is set, call unshare_fd() with
  CLONE_FILES to clone the fd table
* unshare_fd() calls dup_fd()
* dup_fd() allocates a new fdtable, and if the resulting fdtable ends up
  being too small to hold the number of fds calculated by
  sane_fdtable_size(), fails with EMFILE

I suspect that, given that we’re starting with a valid fdtable, the only way
this can happen is if there’s a race with sysctl_nr_open being reduced.

Incidentally, isn’t this comment in file.c somewhat misleading?

		/*
		 * If the requested range is greater than the current maximum,
		 * we're closing everything so only copy all file descriptors
		 * beneath the lowest file descriptor.
		 */

As I understand it, dup_fd() will always copy any open file descriptor
anyway, it won’t stop at max_unshare_fds if that’s lower than the number of
open fds (thanks to save_fdtable_size())...

Regards,

Stephen

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v6] close_range.2: new page documenting close_range(2)
  2021-01-28 20:50 ` Michael Kerrisk (man-pages)
  2021-01-28 22:10   ` Stephen Kitt
@ 2021-01-29 10:01   ` Christian Brauner
  2021-03-09 19:53   ` Stephen Kitt
  2 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2021-01-29 10:01 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Stephen Kitt, linux-man, Alejandro Colomar, Giuseppe Scrivano,
	linux-kernel

On Thu, Jan 28, 2021 at 09:50:23PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Stephen, (and CHristian, please!)

Ah, I think this was mostly done which is why I kept quiet.

Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v6] close_range.2: new page documenting close_range(2)
  2021-01-28 20:50 ` Michael Kerrisk (man-pages)
  2021-01-28 22:10   ` Stephen Kitt
  2021-01-29 10:01   ` Christian Brauner
@ 2021-03-09 19:53   ` Stephen Kitt
  2021-03-21 15:38     ` Michael Kerrisk (man-pages)
  2 siblings, 1 reply; 8+ messages in thread
From: Stephen Kitt @ 2021-03-09 19:53 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-man, Alejandro Colomar, Christian Brauner,
	Giuseppe Scrivano, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 328 bytes --]

Hi Michael,

On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)"
<mtk.manpages@gmail.com> wrote:
> Thanks for your patch revision. I've merged it, and have
> done some light editing, but I still have a question:

Does this need anything more? I don’t see it in the man-pages repo.

Regards,

Stephen

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v6] close_range.2: new page documenting close_range(2)
       [not found]     ` <20210129100024.m4bil5mz5prry4iq@wittgenstein>
@ 2021-03-21 15:31       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-03-21 15:31 UTC (permalink / raw)
  To: Christian Brauner, Stephen Kitt
  Cc: mtk.manpages, linux-man, Alejandro Colomar, Giuseppe Scrivano,
	linux-kernel

Hello Stephen and Christian,

Late follow-up, I'm afraid...

On 1/29/21 11:00 AM, Christian Brauner wrote:
> On Thu, Jan 28, 2021 at 11:10:40PM +0100, Stephen Kitt wrote:
>> Hello Michael,
>>
>> On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)"
>> <mtk.manpages@gmail.com> wrote:
>>> Thanks for your patch revision. I've merged it, and have
>>> done some light editing, but I still have a question:
>>>
>>> On 1/23/21 5:11 PM, Stephen Kitt wrote:
>>>
>>> [...]
>>>
>>>> +.SH ERRORS  
>>>
>>>> +.TP
>>>> +.B EMFILE
>>>> +The per-process limit on the number of open file descriptors has been
>>>> reached +(see the description of
>>>> +.B RLIMIT_NOFILE
>>>> +in
>>>> +.BR getrlimit (2)).  
>>>
>>> I think there was already a question about this error, but
>>> I still have a doubt.
>>>
>>> A glance at the code tells me that indeed EMFILE can occur.
>>> But how can the reason be because the limit on the number
>>> of open file descriptors has been reached? I mean: no new
>>> FDs are being opened, so how can we go over the limit. I think
>>> the cause of this error is something else, but what is it?
>>
>> Here’s how I understand the code that can lead to EMFILE:
>>
>> * in __close_range(), if CLOSE_RANGE_UNSHARE is set, call unshare_fd() with
>>   CLONE_FILES to clone the fd table
>> * unshare_fd() calls dup_fd()
>> * dup_fd() allocates a new fdtable, and if the resulting fdtable ends up
>>   being too small to hold the number of fds calculated by
>>   sane_fdtable_size(), fails with EMFILE
>>
>> I suspect that, given that we’re starting with a valid fdtable, the only way
>> this can happen is if there’s a race with sysctl_nr_open being reduced.
> 
> Yes, and sysctls are racy by nature.

Got it, I think. I changed the error text here to:

       EMFILE The number of open file descriptors exceeds the limit spec‐
              ified in /proc/sys/fs/nr_open (see  proc(5)).   This  error
              can occur in situations where that limit was lowered before
              a call to close_range() where the CLOSE_RANGE_UNSHARE  flag
              is specified.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v6] close_range.2: new page documenting close_range(2)
  2021-03-09 19:53   ` Stephen Kitt
@ 2021-03-21 15:38     ` Michael Kerrisk (man-pages)
  2021-03-22 21:31       ` Stephen Kitt
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Kerrisk (man-pages) @ 2021-03-21 15:38 UTC (permalink / raw)
  To: Stephen Kitt
  Cc: mtk.manpages, linux-man, Alejandro Colomar, Christian Brauner,
	Giuseppe Scrivano, linux-kernel

On 3/9/21 8:53 PM, Stephen Kitt wrote:
> Hi Michael,
> 
> On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)"
> <mtk.manpages@gmail.com> wrote:
>> Thanks for your patch revision. I've merged it, and have
>> done some light editing, but I still have a question:
> 
> Does this need anything more? I don’t see it in the man-pages repo.

Sorry, Stephen. It's just me being slow. I've made a few edits,
replaced the example program with another that more clearly allows
the user to see what's going on, and pushed to Git.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v6] close_range.2: new page documenting close_range(2)
  2021-03-21 15:38     ` Michael Kerrisk (man-pages)
@ 2021-03-22 21:31       ` Stephen Kitt
  0 siblings, 0 replies; 8+ messages in thread
From: Stephen Kitt @ 2021-03-22 21:31 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-man, Alejandro Colomar, Christian Brauner,
	Giuseppe Scrivano, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 734 bytes --]

On Sun, 21 Mar 2021 16:38:59 +0100, "Michael Kerrisk (man-pages)"
<mtk.manpages@gmail.com> wrote:
> On 3/9/21 8:53 PM, Stephen Kitt wrote:
> > On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)"
> > <mtk.manpages@gmail.com> wrote:  
> >> Thanks for your patch revision. I've merged it, and have
> >> done some light editing, but I still have a question:  
> > 
> > Does this need anything more? I don’t see it in the man-pages repo.  
> 
> Sorry, Stephen. It's just me being slow. I've made a few edits,
> replaced the example program with another that more clearly allows
> the user to see what's going on, and pushed to Git.

Thanks, your example program is indeed much better!

Regards,

Stephen

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-03-22 22:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-23 16:11 [PATCH v6] close_range.2: new page documenting close_range(2) Stephen Kitt
2021-01-28 20:50 ` Michael Kerrisk (man-pages)
2021-01-28 22:10   ` Stephen Kitt
     [not found]     ` <20210129100024.m4bil5mz5prry4iq@wittgenstein>
2021-03-21 15:31       ` Michael Kerrisk (man-pages)
2021-01-29 10:01   ` Christian Brauner
2021-03-09 19:53   ` Stephen Kitt
2021-03-21 15:38     ` Michael Kerrisk (man-pages)
2021-03-22 21:31       ` Stephen Kitt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.