Re: For review: seccomp_user_notif(2) manual page [v2] - Michael Kerrisk (man-pages)

From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: Sargun Dhillon <sargun@sargun.me>
Cc: mtk.manpages@gmail.com, Tycho Andersen <tycho@tycho.pizza>,
	Christian Brauner <christian@brauner.io>,
	Kees Cook <keescook@chromium.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Giuseppe Scrivano <gscrivan@redhat.com>,
	Song Liu <songliubraving@fb.com>,
	Robert Sesek <rsesek@google.com>,
	Containers <containers@lists.linux-foundation.org>,
	linux-man <linux-man@vger.kernel.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Aleksa Sarai <cyphar@cyphar.com>, Jann Horn <jannh@google.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Will Drewry <wad@chromium.org>, bpf <bpf@vger.kernel.org>,
	Andy Lutomirski <luto@amacapital.net>
Subject: Re: For review: seccomp_user_notif(2) manual page [v2]
Date: Sat, 31 Oct 2020 17:27:51 +0100	[thread overview]
Message-ID: <606199d6-b48c-fee2-6e79-1e52bd7f429f@gmail.com> (raw)
In-Reply-To: <20201030202720.GA4088@ircssh-2.c.rugged-nimbus-611.internal>

Hello Sargun,

Thanks for your reply.

On 10/30/20 9:27 PM, Sargun Dhillon wrote:
> On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
> wrote:

[...]

>>> I think I commented in another thread somewhere that the
>>> supervisor is not notified if the syscall is preempted. Therefore
>>> if it is performing a preemptible, long-running syscall, you need
>>> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
>>> you can end up in a bad situation -- like leaking resources, or
>>> holding on to file descriptors after the program under
>>> supervision has intended to release them.
>> 
>> It's been a long day, and I'm not sure I reallu understand this. 
>> Could you outline the scnario in more detail?
>> 
> S: Sets up filter + interception for accept T: socket(AF_INET,
> SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, 4444}, ..) T: listen(7,
> 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.

Presumably, the preceding line should have been:

S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
(s/T:/S:/)

right?

> T: accept(7, ...) S: Intercepts accept S: Does accept in background 
> T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
> Still running accept(7, ....), holding port 4444, so if now T
> retries to bind to port 4444, things fail.

Okay -- I understand. Presumably the solution here is not to 
block in accept(), but rather to use poll() to monitor both the
notification FD and the listening socket FD?

>>> A very specific example is if you're performing an accept on
>>> behalf of the program generating the notification, and the
>>> program intends to reuse the port. You can get into all sorts of
>>> awkward situations there.
>> 
>> [...]
>> 
> See above

[...]

>>> In addition, if it is a socket, it inherits the cgroup v1 classid
>>> and netprioidx of the receiving process.
>>> 
>>> The argument of this is as follows:
>>> 
>>> struct seccomp_notif_addfd { __u64 id; __u32 flags; __u32 srcfd; 
>>> __u32 newfd; __u32 newfd_flags; };
>>> 
>>> id This is the cookie value that was obtained using 
>>> SECCOMP_IOCTL_NOTIF_RECV.
>>> 
>>> flags A bitmask that includes zero or more of the 
>>> SECCOMP_ADDFD_FLAG_* bits set
>>> 
>>> SECCOMP_ADDFD_FLAG_SETFD - Use dup2 (or dup3?) like semantics
>>> when copying the file descriptor.
>>> 
>>> srcfd The file descriptor number to copy in the supervisor
>>> process.
>>> 
>>> newfd If the SECCOMP_ADDFD_FLAG_SETFD flag is specified this will
>>> be the file descriptor that is used in the dup2 semantics. If
>>> this file descriptor exists in the receiving process, it is
>>> closed and replaced by this file descriptor in an atomic fashion.
>>> If the copy process fails due to a MAC failure, or if srcfd is
>>> invalid, the newfd will not be closed in the receiving process.
>> 
>> Great description!
>> 
>>> If SECCOMP_ADDFD_FLAG_SETFD it not set, then this value must be
>>> 0.
>>> 
>>> newfd_flags The file descriptor flags to set on the file
>>> descriptor after it has been received by the process. The only
>>> flag that can currently be specified is O_CLOEXEC.
>>> 
>>> On success, this operation returns the file descriptor number in
>>> the receiving process. On failure, -1 is returned.
>>> 
>>> It can fail with the following error codes:
>>> 
>>> EINPROGRESS The cookie number specified hasn't been received by
>>> the listener
>> 
>> I don't understand this. Can you say more about the scenario?
>> 
> 
> This should not really happen. But if you do a ADDFD(...), on a
> notification *before* you've received it, you will get this error. So
> for example, 
> --> epoll(....) -> returns 
> --> RECV(...) cookie id is 777
> --> epoll(...) -> returns
> <-- ioctl(ADDFD, id = 778) # Notice how we haven't done a receive yet
> where we've received a notification for 778.

Got it. Looking also at the source code, I came up with the 
following:

              EINPROGRESS
                     The user-space notification specified in the id
                     field exists but has not yet been fetched (by a
                     SECCOMP_IOCTL_NOTIF_RECV) or has already been
                     responded to (by a SECCOMP_IOCTL_NOTIF_SEND).

Does that seem okay?

>>> ENOENT The cookie number is not valid. This can happen if a
>>> response has already been sent, or if the syscall was
>>> interrupted
>>> 
>>> EBADF If the file descriptor specified in srcfd is invalid, or if
>>> the fd is out of range of the destination program.
>> 
>> The piece "or if the fd is out of range of the destination program"
>> is not clear to me. Can you say some more please.
>> 
> 
> IIRC the maximum fd range is specific in proc by some sysctl named 
> nr_open. It's also evaluated against RLIMITs, and nr_max.
> 
> If nr-open (maximum fds open per process, iiirc) is 1000, even if 10
> FDs are open, it wont work if newfd is 1001.

Actually, the relevant limit seems to be just the RLIMIT_NOFILE
resource limit at least in my reading of fs/file.c::replace_fd().
So I made the text

              EBADF  Allocating the file descriptor in the target would
                     cause the target's RLIMIT_NOFILE limit to be
                     exceeded (see getrlimit(2)).

>>> EINVAL If flags or new_flags were unrecognized, or if newfd is
>>> non-zero, and SECCOMP_ADDFD_FLAG_SETFD has not been set.
>>> 
>>> EMFILE Too many files are open by the destination process.

I'm not sure that the error can really occur. That's the error
that in most other places occurs when RLIMIT_NOFILE is exceeded.
But I may have missed something. More precisely, when do you think
EMFILE can occur?

[...]

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/