Re: [patch] close_range.2: new page documenting close_range(2)

From: Christian Brauner <christian.brauner@ubuntu.com>
To: "Alejandro Colomar (man-pages)" <alx.manpages@gmail.com>
Cc: Stephen Kitt <steve@sk2.org>,
	linux-man@vger.kernel.org,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [patch] close_range.2: new page documenting close_range(2)
Date: Sat, 12 Dec 2020 13:14:19 +0100	[thread overview]
Message-ID: <20201212121419.odpgbaigrjhpkjnm@wittgenstein> (raw)
In-Reply-To: <0ea38a7a-1c64-086e-3d64-38686f5b7856@gmail.com>

On Thu, Dec 10, 2020 at 03:36:42PM +0100, Alejandro Colomar (man-pages) wrote:
> Hi Christian,

Hi Alex,

> 
> Thanks for confirming that behavior.  Seems reasonable.
> 
> I was wondering...
> If this call is equivalent to unshare(2)+{close(2) in a loop},
> shouldn't it fail for the same reasons those syscalls can fail?
> 
> What about the following errors?:
> 
> From unshare(2):
> 
>        EPERM  The calling process did not have the  required  privi‐
>               leges for this operation.

unshare(CLONE_FILES) doesn't require any privileges. Only flags relevant
to kernel/nsproxy.c:unshare_nsproxy_namespaces() require privileges,
i.e.
CLONE_NEWNS
CLONE_NEWUTS
CLONE_NEWIPC
CLONE_NEWNET
CLONE_NEWPID
CLONE_NEWCGROUP
CLONE_NEWTIME
so the permissions are the same.

> 
> From close(2):
>        EBADF  fd isn't a valid open file descriptor.
> 
> OK, this one can't happen with the current code.
> Let's say there are fds 1 to 10, and you call 'close_range(20,30,0)'.
> It's a no-op (although it will still unshare if the flag is set).
> But souldn't it fail with EBADF?

CLOSE_RANGE_UNSHARE should always give you a private file descriptor
table independent of whether or not any file descriptors need to be
closed. That's also how we documented the flag:

/* Unshare the file descriptor table before closing file descriptors. */
#define CLOSE_RANGE_UNSHARE	(1U << 1)

A caller calling unshare(CLONE_FILES) and then an emulated close_range()
or the proper close_range() syscall wants to make sure that all unwanted
file descriptors are closed (if any) and that no new file descriptors
can be injected afterwards. If you skip the unshare(CLONE_FILES) because
there are no fds to be closed you open up a race window. It would also
be annoying for userspace if they _may_ have received a private file
descriptor table but only if any fds needed to be closed.

If people really were extremely keen about skipping the unshare when no
fd needs to be closed then this could become a new flag. But I really
don't think that's necessary and also doesn't make a lot of sense, imho.

> 
>        EINTR  The close() call was interrupted by a signal; see sig‐
>               nal(7).
> 
>        EIO    An I/O error occurred.
> 
>        ENOSPC, EDQUOT
>               On NFS, these errors are not normally reported against
>               the first write which exceeds  the  available  storage
>               space,  but  instead  against  a  subsequent write(2),
>               fsync(2), or close().

None of these will be seen by userspace because close_range() currently
ignores all errors after it has begun closing files.

Christian