From: Christian Brauner <christian@brauner.io>
To: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: viro@zeniv.linux.org.uk, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
torvalds@linux-foundation.org, fweimer@redhat.com,
jannh@google.com, oleg@redhat.com, tglx@linutronix.de,
arnd@arndb.de, shuah@kernel.org, dhowells@redhat.com,
tkjos@android.com, ldv@altlinux.org, miklos@szeredi.hu,
linux-alpha@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, linux-ia64@vger.kernel.org,
linux-m68k@lists.linux-m68k.org, linux-mips@vger.kernel.org,
linux-parisc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
linux-s390@vger.kernel.org, linux-sh@vger.kernel.org,
sparclinux@vger.kernel.org, linux-xtensa@linux-xtensa.org,
linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org,
x86@kernel.org
Subject: Re: [PATCH v1 1/2] open: add close_range()
Date: Fri, 24 May 2019 12:31:39 +0200 [thread overview]
Message-ID: <20190524103137.otw7tbmagiqnf6u6@brauner.io> (raw)
In-Reply-To: <67e4458a-9cc4-d1aa-608c-73ebe9e2f7a3@yandex-team.ru>
On Thu, May 23, 2019 at 07:22:17PM +0300, Konstantin Khlebnikov wrote:
> On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() syscall. It allows to efficiently close a range
> > of file descriptors up to all file descriptors of a calling task.
> >
> > The syscall came up in a recent discussion around the new mount API and
> > making new file descriptor types cloexec by default. During this
> > discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> > syscall in this manner has been requested by various people over time.
> >
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> >
> > /* that exec is sensitive */
> > unshare(CLONE_FILES);
> > /* we don't want anything past stderr here */
> > close_range(3, ~0U);
> > execve(....);
> >
> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> > task is multi-threaded and shares the file descriptor table with another
> > thread in which case two threads could race with one thread allocating
> > file descriptors and the other one closing them via close_range(). For the
> > general case close_range() before the execve() is sufficient.)
> >
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> > - Python (cf. [2])
> > - Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
> >
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> >
> > static int close_all_fds(void)
> > {
> > int dir_fd;
> > DIR *dir;
> > struct dirent *direntp;
> >
> > dir = opendir("/proc/self/fd");
> > if (!dir)
> > return -1;
> > dir_fd = dirfd(dir);
> > while ((direntp = readdir(dir))) {
> > int fd;
> > if (strcmp(direntp->d_name, ".") == 0)
> > continue;
> > if (strcmp(direntp->d_name, "..") == 0)
> > continue;
> > fd = atoi(direntp->d_name);
> > if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> > continue;
> > close(fd);
> > }
> > closedir(dir);
> > return 0;
> > }
> >
> > to close_range() yields:
> > 1. closing 4 open files:
> > - close_all_fds(): ~280 us
> > - close_range(): ~24 us
> >
> > 2. closing 1000 open files:
> > - close_all_fds(): ~5000 us
> > - close_range(): ~800 us
> >
> > close_range() is designed to allow for some flexibility. Specifically, it
> > does not simply always close all open file descriptors of a task. Instead,
> > callers can specify an upper bound.
> > This is e.g. useful for scenarios where specific file descriptors are
> > created with well-known numbers that are supposed to be excluded from
> > getting closed.
> > For extra paranoia close_range() comes with a flags argument. This can e.g.
> > be used to implement extension. Once can imagine userspace wanting to stop
> > at the first error instead of ignoring errors under certain circumstances.
>
> > There might be other valid ideas in the future. In any case, a flag
> > argument doesn't hurt and keeps us on the safe side.
>
> Here is another strange but real-live scenario: crash handler for dumping core.
>
> If applications has network connections it would be better to close them all,
> otherwise clients will wait until end of dumping process or timeout.
> Also closing normal files might be a good idea for releasing locks.
>
> But simple closing might race with other threads - closed fd will be reused
> while some code still thinks it refers to original file.
>
> Our solution closes files without freeing fd: it opens /dev/null and
> replaces all opened descriptors using dup2.
>
> So, special flag for close_range() could close files without clearing bitmap.
> Effect should be the same - fd wouldn't be reused.
>
> Actually two flags for two phases: closing files and releasing fd.
Konstantin, I'm sorry, I totally missed that part of your mail
yesterday.
Without speaking to the feasibility of this it's at least a good
illustration that people really do have the possible need for a flag
argument.
Thanks!
Christian
prev parent reply other threads:[~2019-05-24 10:31 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-22 15:52 [PATCH v1 1/2] open: add close_range() Christian Brauner
2019-05-22 15:52 ` [PATCH v1 2/2] tests: add close_range() tests Christian Brauner
2019-05-22 16:57 ` [PATCH v1 1/2] open: add close_range() Oleg Nesterov
2019-05-23 11:51 ` Christian Brauner
2019-05-23 14:14 ` Christian Brauner
2019-05-23 14:28 ` Christian Brauner
2019-05-23 16:21 ` Oleg Nesterov
2019-05-23 14:32 ` Jann Horn
2019-05-23 14:37 ` Christian Brauner
2019-05-23 16:22 ` Konstantin Khlebnikov
2019-05-23 16:29 ` David Laight
2019-05-23 16:33 ` Christian Brauner
2019-05-24 7:43 ` Arnd Bergmann
2019-05-24 9:57 ` Christian Brauner
2019-05-24 10:31 ` Christian Brauner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190524103137.otw7tbmagiqnf6u6@brauner.io \
--to=christian@brauner.io \
--cc=arnd@arndb.de \
--cc=dhowells@redhat.com \
--cc=fweimer@redhat.com \
--cc=jannh@google.com \
--cc=khlebnikov@yandex-team.ru \
--cc=ldv@altlinux.org \
--cc=linux-alpha@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-ia64@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-m68k@lists.linux-m68k.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-parisc@vger.kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=linux-sh@vger.kernel.org \
--cc=linux-xtensa@linux-xtensa.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=miklos@szeredi.hu \
--cc=oleg@redhat.com \
--cc=shuah@kernel.org \
--cc=sparclinux@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=tkjos@android.com \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).