linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* SECCOMP_IOCTL_NOTIF_ADDFD race condition
@ 2020-11-26 13:09 Alban Crequy
  2020-11-30 23:20 ` Tycho Andersen
  0 siblings, 1 reply; 8+ messages in thread
From: Alban Crequy @ 2020-11-26 13:09 UTC (permalink / raw)
  To: LKML, Linux Containers
  Cc: Sargun Dhillon, Kees Cook, Giuseppe Scrivano, Christian Brauner,
	Tycho Andersen, Rodrigo Campos, Mauricio Vásquez Bernal

Hi,

With the addfd feature (added in “seccomp: Introduce addfd ioctl to
seccomp user notifier”, commit 7cf97b125455), the new file is
installed in the target process during the SECCOMP_IOCTL_NOTIF_ADDFD
operation and not at the end with the SECCOMP_IOCTL_NOTIF_SEND
operation. This can cause race conditions when the target process is
interrupted by a signal (EINTR) and restarted automatically.

This is more noticeable in multithreaded processes like with Golang.
In Golang 1.14:
https://golang.org/doc/go1.14
> "A consequence of the implementation of preemption is that on Unix systems, including Linux and macOS systems, programs built with Go 1.14 will receive more signals than programs built with earlier releases. This means that programs that use packages like syscall or golang.org/x/sys/unix will see more slow system calls fail with EINTR errors. Those programs will have to handle those errors in some way, most likely looping to try the system call again."

In my test, I added a seccomp policy which returns
SECCOMP_RET_USER_NOTIF on execve() and I added a sleep(2) in the
seccomp agent (using https://github.com/kinvolk/seccompagent/) between
SECCOMP_IOCTL_NOTIF_RECV and SECCOMP_IOCTL_NOTIF_SEND to make it a bit
slow to reply with SECCOMP_USER_NOTIF_FLAG_CONTINUE. I got the
following strace log going on in a loop:

[pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
0xc000063b00 /* 11 vars */ <unfinished ...>
[pid 2656200] <... nanosleep resumed>NULL) = 0
[pid 2656200] epoll_pwait(7, [], 128, 0, NULL, 0) = 0
[pid 2656200] getpid()                  = 1
[pid 2656200] tgkill(1, 1, SIGURG)      = 0
[pid 2656199] <... execve resumed>)     = ? ERESTARTSYS (To be
restarted if SA_RESTART is set)
[pid 2656200] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2656199] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1,
si_uid=0} ---
[pid 2656199] rt_sigreturn({mask=[]})   = 59
[pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
0xc000063b00 /* 11 vars */ <unfinished ...>

On the seccomp agent side, the ioctl(SECCOMP_IOCTL_NOTIF_SEND) returns
ENOENT, and then it receives the same notification at the next
iteration of the loop.

The SIGURG signal is sent by the Golang runtime, causing the execve to
be interrupted, and restarted automatically, triggering the new
seccomp notification. In this example with execve, this is not a big
deal because the seccomp agent doesn't add a fd. But on a open() or
accept() syscall, I fear that the seccomp agent could install a file
descriptor without knowing that the syscall will be interrupted soon
after, but before the SECCOMP_IOCTL_NOTIF_SEND is completed.

I understand the need to have two different ioctl() to add the fd and
to reply to the seccomp notification because the seccomp agent needs
to know the fd number being assigned before specifying the return
value of the syscall with that number.

What do you think is the best way to solve this problem? Here are a few ideas:

- Idea 1: add a second flag for the struct seccomp_notif_resp
“SECCOMP_USER_NOTIF_FLAG_RETURN_FD” to instruct seccomp to override
the return value with the first fd to install. It would not help to
emulate recvfrom() with SCM_RIGHTS but it will solve the problem for
syscalls that return a fd because we can then implement a new ioctl
(“SECCOMP_IOCTL_NOTIF_SEND_WITH_FDS”?) that does the addfd and the
notification response in one step.

Other ideas but they cause more problems:

- Idea 2: We need some kind of transactions where the fd is sent with
the first ioctl() and installed in the fd table but marked somehow to
be closed automatically if the syscall is interrupted with EINTR
outside of the control of the seccomp agent. The new fd in the fd
table would be committed at the end if the syscall is not interrupted.
But this introduces other issues: another thread could call dup() on
the fd before it gets closed. Or another process sharing the fd table
with CLONE_FILES could do the same. Should the not-yet-committed fds
be visible in /proc/<pid>/fd/? Or inherited to new processes created
by fork()?

- Idea 3: We could add fds in a temporary location but not in the
`struct files_struct` of the target process, and only commit at
SECCOMP_IOCTL_NOTIF_SEND time. In this way, threads or processes
sharing the fd table with CLONE_FILES would not be impacted. However,
this could open new race conditions if other threads are installing
fds in the same slots in the fd table. Also, this seems quite
dangerous to add this concept of "inflight" fd for seccomp because
there are already inflight fds for SCM_RIGHT and a garbage collector
to clean circular references (net/unix/garbage.c). If we add an
inflight fd mechanism on seccomp, a malicious user could just use
SECCOMP_IOCTL_NOTIF_ADDFD to send a unix socket that has the
seccomp-fd inflight in SCM_RIGHT. Then, the malicious seccomp agent
would close(seccompFd) and we will be in a situation where both the
seccomp-fd and the unix socket are not attached to any process but
they reference each other, so they cannot be closed.

What do you think? Is there a better solution?

Cheers
Alban

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition
  2020-11-26 13:09 SECCOMP_IOCTL_NOTIF_ADDFD race condition Alban Crequy
@ 2020-11-30 23:20 ` Tycho Andersen
  2020-12-01  4:08   ` Sargun Dhillon
  2020-12-01 12:41   ` Tycho Andersen
  0 siblings, 2 replies; 8+ messages in thread
From: Tycho Andersen @ 2020-11-30 23:20 UTC (permalink / raw)
  To: Alban Crequy
  Cc: LKML, Linux Containers, Sargun Dhillon, Kees Cook,
	Giuseppe Scrivano, Christian Brauner, Rodrigo Campos,
	Mauricio Vásquez Bernal

Hi,

On Thu, Nov 26, 2020 at 02:09:33PM +0100, Alban Crequy wrote:
> Hi,
> 
> With the addfd feature (added in “seccomp: Introduce addfd ioctl to
> seccomp user notifier”, commit 7cf97b125455), the new file is
> installed in the target process during the SECCOMP_IOCTL_NOTIF_ADDFD
> operation and not at the end with the SECCOMP_IOCTL_NOTIF_SEND
> operation. This can cause race conditions when the target process is
> interrupted by a signal (EINTR) and restarted automatically.
> 
> This is more noticeable in multithreaded processes like with Golang.
> In Golang 1.14:
> https://golang.org/doc/go1.14
> > "A consequence of the implementation of preemption is that on Unix systems, including Linux and macOS systems, programs built with Go 1.14 will receive more signals than programs built with earlier releases. This means that programs that use packages like syscall or golang.org/x/sys/unix will see more slow system calls fail with EINTR errors. Those programs will have to handle those errors in some way, most likely looping to try the system call again."
> 
> In my test, I added a seccomp policy which returns
> SECCOMP_RET_USER_NOTIF on execve() and I added a sleep(2) in the
> seccomp agent (using https://github.com/kinvolk/seccompagent/) between
> SECCOMP_IOCTL_NOTIF_RECV and SECCOMP_IOCTL_NOTIF_SEND to make it a bit
> slow to reply with SECCOMP_USER_NOTIF_FLAG_CONTINUE. I got the
> following strace log going on in a loop:
> 
> [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
> 0xc000063b00 /* 11 vars */ <unfinished ...>
> [pid 2656200] <... nanosleep resumed>NULL) = 0
> [pid 2656200] epoll_pwait(7, [], 128, 0, NULL, 0) = 0
> [pid 2656200] getpid()                  = 1
> [pid 2656200] tgkill(1, 1, SIGURG)      = 0
> [pid 2656199] <... execve resumed>)     = ? ERESTARTSYS (To be
> restarted if SA_RESTART is set)
> [pid 2656200] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
> [pid 2656199] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1,
> si_uid=0} ---
> [pid 2656199] rt_sigreturn({mask=[]})   = 59
> [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
> 0xc000063b00 /* 11 vars */ <unfinished ...>
> 
> On the seccomp agent side, the ioctl(SECCOMP_IOCTL_NOTIF_SEND) returns
> ENOENT, and then it receives the same notification at the next
> iteration of the loop.
> 
> The SIGURG signal is sent by the Golang runtime, causing the execve to
> be interrupted, and restarted automatically, triggering the new
> seccomp notification. In this example with execve, this is not a big
> deal because the seccomp agent doesn't add a fd. But on a open() or
> accept() syscall, I fear that the seccomp agent could install a file
> descriptor without knowing that the syscall will be interrupted soon
> after, but before the SECCOMP_IOCTL_NOTIF_SEND is completed.
> 
> I understand the need to have two different ioctl() to add the fd and
> to reply to the seccomp notification because the seccomp agent needs
> to know the fd number being assigned before specifying the return
> value of the syscall with that number.
> 
> What do you think is the best way to solve this problem? Here are a few ideas:
> 
> - Idea 1: add a second flag for the struct seccomp_notif_resp
> “SECCOMP_USER_NOTIF_FLAG_RETURN_FD” to instruct seccomp to override
> the return value with the first fd to install. It would not help to
> emulate recvfrom() with SCM_RIGHTS but it will solve the problem for
> syscalls that return a fd because we can then implement a new ioctl
> (“SECCOMP_IOCTL_NOTIF_SEND_WITH_FDS”?) that does the addfd and the
> notification response in one step.
> 
> Other ideas but they cause more problems:
> 
> - Idea 2: We need some kind of transactions where the fd is sent with
> the first ioctl() and installed in the fd table but marked somehow to
> be closed automatically if the syscall is interrupted with EINTR
> outside of the control of the seccomp agent. The new fd in the fd
> table would be committed at the end if the syscall is not interrupted.
> But this introduces other issues: another thread could call dup() on
> the fd before it gets closed. Or another process sharing the fd table
> with CLONE_FILES could do the same. Should the not-yet-committed fds
> be visible in /proc/<pid>/fd/? Or inherited to new processes created
> by fork()?
> 
> - Idea 3: We could add fds in a temporary location but not in the
> `struct files_struct` of the target process, and only commit at
> SECCOMP_IOCTL_NOTIF_SEND time. In this way, threads or processes
> sharing the fd table with CLONE_FILES would not be impacted. However,
> this could open new race conditions if other threads are installing
> fds in the same slots in the fd table. Also, this seems quite
> dangerous to add this concept of "inflight" fd for seccomp because
> there are already inflight fds for SCM_RIGHT and a garbage collector
> to clean circular references (net/unix/garbage.c). If we add an
> inflight fd mechanism on seccomp, a malicious user could just use
> SECCOMP_IOCTL_NOTIF_ADDFD to send a unix socket that has the
> seccomp-fd inflight in SCM_RIGHT. Then, the malicious seccomp agent
> would close(seccompFd) and we will be in a situation where both the
> seccomp-fd and the unix socket are not attached to any process but
> they reference each other, so they cannot be closed.
> 
> What do you think? Is there a better solution?

Idea 1 sounds best to me, but maybe that's because it's the way I
originally did the fd support that never landed :)

But here's an Idea 4: we add a way to remotely close an fd (I don't
see that the current infra can do this, but perhaps I didn't look hard
enough), and then when you get ENOENT you have to close the fd. Of
course, this can't be via seccomp, so maybe it's even more racy.

Tycho

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition
  2020-11-30 23:20 ` Tycho Andersen
@ 2020-12-01  4:08   ` Sargun Dhillon
  2020-12-01 12:41   ` Tycho Andersen
  1 sibling, 0 replies; 8+ messages in thread
From: Sargun Dhillon @ 2020-12-01  4:08 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Alban Crequy, LKML, Linux Containers, Kees Cook,
	Giuseppe Scrivano, Christian Brauner, Rodrigo Campos,
	Mauricio Vásquez Bernal

On Mon, Nov 30, 2020 at 06:20:09PM -0500, Tycho Andersen wrote:
> Hi,
> 
> On Thu, Nov 26, 2020 at 02:09:33PM +0100, Alban Crequy wrote:
> > Hi,
> > 
> > With the addfd feature (added in “seccomp: Introduce addfd ioctl to
> > seccomp user notifier”, commit 7cf97b125455), the new file is
> > installed in the target process during the SECCOMP_IOCTL_NOTIF_ADDFD
> > operation and not at the end with the SECCOMP_IOCTL_NOTIF_SEND
> > operation. This can cause race conditions when the target process is
> > interrupted by a signal (EINTR) and restarted automatically.
> > 
> > This is more noticeable in multithreaded processes like with Golang.
> > In Golang 1.14:
> > https://golang.org/doc/go1.14
> > > "A consequence of the implementation of preemption is that on Unix systems, including Linux and macOS systems, programs built with Go 1.14 will receive more signals than programs built with earlier releases. This means that programs that use packages like syscall or golang.org/x/sys/unix will see more slow system calls fail with EINTR errors. Those programs will have to handle those errors in some way, most likely looping to try the system call again."
> > 
> > In my test, I added a seccomp policy which returns
> > SECCOMP_RET_USER_NOTIF on execve() and I added a sleep(2) in the
> > seccomp agent (using https://github.com/kinvolk/seccompagent/) between
> > SECCOMP_IOCTL_NOTIF_RECV and SECCOMP_IOCTL_NOTIF_SEND to make it a bit
> > slow to reply with SECCOMP_USER_NOTIF_FLAG_CONTINUE. I got the
> > following strace log going on in a loop:
> > 
> > [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
> > 0xc000063b00 /* 11 vars */ <unfinished ...>
> > [pid 2656200] <... nanosleep resumed>NULL) = 0
> > [pid 2656200] epoll_pwait(7, [], 128, 0, NULL, 0) = 0
> > [pid 2656200] getpid()                  = 1
> > [pid 2656200] tgkill(1, 1, SIGURG)      = 0
> > [pid 2656199] <... execve resumed>)     = ? ERESTARTSYS (To be
> > restarted if SA_RESTART is set)
> > [pid 2656200] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
> > [pid 2656199] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1,
> > si_uid=0} ---
> > [pid 2656199] rt_sigreturn({mask=[]})   = 59
> > [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
> > 0xc000063b00 /* 11 vars */ <unfinished ...>
> > 
> > On the seccomp agent side, the ioctl(SECCOMP_IOCTL_NOTIF_SEND) returns
> > ENOENT, and then it receives the same notification at the next
> > iteration of the loop.
> > 
> > The SIGURG signal is sent by the Golang runtime, causing the execve to
> > be interrupted, and restarted automatically, triggering the new
> > seccomp notification. In this example with execve, this is not a big
> > deal because the seccomp agent doesn't add a fd. But on a open() or
> > accept() syscall, I fear that the seccomp agent could install a file
> > descriptor without knowing that the syscall will be interrupted soon
> > after, but before the SECCOMP_IOCTL_NOTIF_SEND is completed.
> > 
> > I understand the need to have two different ioctl() to add the fd and
> > to reply to the seccomp notification because the seccomp agent needs
> > to know the fd number being assigned before specifying the return
> > value of the syscall with that number.
> > 
> > What do you think is the best way to solve this problem? Here are a few ideas:
> > 
> > - Idea 1: add a second flag for the struct seccomp_notif_resp
> > “SECCOMP_USER_NOTIF_FLAG_RETURN_FD” to instruct seccomp to override
> > the return value with the first fd to install. It would not help to
> > emulate recvfrom() with SCM_RIGHTS but it will solve the problem for
> > syscalls that return a fd because we can then implement a new ioctl
> > (“SECCOMP_IOCTL_NOTIF_SEND_WITH_FDS”?) that does the addfd and the
> > notification response in one step.
> > 
> > Other ideas but they cause more problems:
> > 
> > - Idea 2: We need some kind of transactions where the fd is sent with
> > the first ioctl() and installed in the fd table but marked somehow to
> > be closed automatically if the syscall is interrupted with EINTR
> > outside of the control of the seccomp agent. The new fd in the fd
> > table would be committed at the end if the syscall is not interrupted.
> > But this introduces other issues: another thread could call dup() on
> > the fd before it gets closed. Or another process sharing the fd table
> > with CLONE_FILES could do the same. Should the not-yet-committed fds
> > be visible in /proc/<pid>/fd/? Or inherited to new processes created
> > by fork()?
> > 
> > - Idea 3: We could add fds in a temporary location but not in the
> > `struct files_struct` of the target process, and only commit at
> > SECCOMP_IOCTL_NOTIF_SEND time. In this way, threads or processes
> > sharing the fd table with CLONE_FILES would not be impacted. However,
> > this could open new race conditions if other threads are installing
> > fds in the same slots in the fd table. Also, this seems quite
> > dangerous to add this concept of "inflight" fd for seccomp because
> > there are already inflight fds for SCM_RIGHT and a garbage collector
> > to clean circular references (net/unix/garbage.c). If we add an
> > inflight fd mechanism on seccomp, a malicious user could just use
> > SECCOMP_IOCTL_NOTIF_ADDFD to send a unix socket that has the
> > seccomp-fd inflight in SCM_RIGHT. Then, the malicious seccomp agent
> > would close(seccompFd) and we will be in a situation where both the
> > seccomp-fd and the unix socket are not attached to any process but
> > they reference each other, so they cannot be closed.
> > 
> > What do you think? Is there a better solution?
> 
> Idea 1 sounds best to me, but maybe that's because it's the way I
> originally did the fd support that never landed :)
> 
> But here's an Idea 4: we add a way to remotely close an fd (I don't
> see that the current infra can do this, but perhaps I didn't look hard
> enough), and then when you get ENOENT you have to close the fd. Of
> course, this can't be via seccomp, so maybe it's even more racy.
> 
> Tycho

I've actually been looking into this problem a little bit more since
Alban pointed it out.

It actually seems like the problem has been made significant worse
recently since Golang introduced asynchronous preemption:

https://medium.com/a-journey-with-go/go-asynchronous-preemption-b5194227371c

> The asynchronous preemption is triggered based on a time condition. When a 
> goroutine is running for more than 10ms, Go will try to preempt it.

Unfortunately, this means that if the "kernel" (or our seccomp manager
processs) takes more than 10ms to handle the request, the request
will likely be EINTR'd, and this will trigger a preemption loop.

This in turn can actually lead to a forever race, or a massive performance
regression, as the seccomp notifier process can take ~10ms easily.

Tracking preemptions / "resumptions" is a pain.

I've been waiting to comment as Eric's patches are in progress:
https://lore.kernel.org/lkml/CAHk-=wge0oJ3fbmNfVek101CO7hg1UfUHnBgxLB3Jmq6-hWLug@mail.gmail.com/

I think that either allocating something in the fdtable and marking it as to be 
allocated in the near future is the best option, the reason is it solves the 
problem of handling interrupted syscalls even if you're adding more than one FD.

The other idea, and to help avoid the messiness of the Golang problem is being 
able to block signals and mark the child as ignoring some set of signals from
the manager.

I'm curious if Al, or Christian knows why allocation and usage requiring a
userspace ping-pong might not be good.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition
  2020-11-30 23:20 ` Tycho Andersen
  2020-12-01  4:08   ` Sargun Dhillon
@ 2020-12-01 12:41   ` Tycho Andersen
  2020-12-01 13:08     ` Sargun Dhillon
  1 sibling, 1 reply; 8+ messages in thread
From: Tycho Andersen @ 2020-12-01 12:41 UTC (permalink / raw)
  To: Alban Crequy; +Cc: Giuseppe Scrivano, Kees Cook, Linux Containers, LKML

On Mon, Nov 30, 2020 at 06:20:09PM -0500, Tycho Andersen wrote:
> Idea 1 sounds best to me, but maybe that's because it's the way I
> originally did the fd support that never landed :)
> 
> But here's an Idea 4: we add a way to remotely close an fd (I don't
> see that the current infra can do this, but perhaps I didn't look hard
> enough), and then when you get ENOENT you have to close the fd. Of
> course, this can't be via seccomp, so maybe it's even more racy.

Or better yet: what if the kernel closed everything it had added via
ADDFD if it didn't get a valid response from the supervisor? Then
everyone gets this bug fixed for free.

Tycho

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition
  2020-12-01 12:41   ` Tycho Andersen
@ 2020-12-01 13:08     ` Sargun Dhillon
  2020-12-01 13:13       ` Tycho Andersen
  0 siblings, 1 reply; 8+ messages in thread
From: Sargun Dhillon @ 2020-12-01 13:08 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Alban Crequy, Giuseppe Scrivano, Linux Containers, Kees Cook, LKML

On Tue, Dec 01, 2020 at 07:41:05AM -0500, Tycho Andersen wrote:
> On Mon, Nov 30, 2020 at 06:20:09PM -0500, Tycho Andersen wrote:
> > Idea 1 sounds best to me, but maybe that's because it's the way I
> > originally did the fd support that never landed :)
> > 
> > But here's an Idea 4: we add a way to remotely close an fd (I don't
> > see that the current infra can do this, but perhaps I didn't look hard
> > enough), and then when you get ENOENT you have to close the fd. Of
> > course, this can't be via seccomp, so maybe it's even more racy.
> 
> Or better yet: what if the kernel closed everything it had added via
> ADDFD if it didn't get a valid response from the supervisor? Then
> everyone gets this bug fixed for free.
> 
> Tycho
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

This doesn't solve the problem universally because of the (Go) preemption 
problem. Unless we can guarantee that the supervisor can always handle the 
request in fewer than 10ms, or if it implements resumption behaviour. I know 
that resumption behaviour is a requirement no matter what, but the easier we can 
make it to implement resumption, the better chance we are giving users to get 
this right.

I think that the easiest solution is to add to the SECCOMP_IOCTL_NOTIF_RECV
ioctl. Either we have a flag like "block all blockable signals" or pass a
sigset_t directly of which signals to allow (and return an einval if they
try to block non-blockable signals).



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition
  2020-12-01 13:08     ` Sargun Dhillon
@ 2020-12-01 13:13       ` Tycho Andersen
  2020-12-01 21:27         ` Kees Cook
  0 siblings, 1 reply; 8+ messages in thread
From: Tycho Andersen @ 2020-12-01 13:13 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Alban Crequy, Giuseppe Scrivano, Linux Containers, Kees Cook, LKML

On Tue, Dec 01, 2020 at 01:08:25PM +0000, Sargun Dhillon wrote:
> On Tue, Dec 01, 2020 at 07:41:05AM -0500, Tycho Andersen wrote:
> > On Mon, Nov 30, 2020 at 06:20:09PM -0500, Tycho Andersen wrote:
> > > Idea 1 sounds best to me, but maybe that's because it's the way I
> > > originally did the fd support that never landed :)
> > > 
> > > But here's an Idea 4: we add a way to remotely close an fd (I don't
> > > see that the current infra can do this, but perhaps I didn't look hard
> > > enough), and then when you get ENOENT you have to close the fd. Of
> > > course, this can't be via seccomp, so maybe it's even more racy.
> > 
> > Or better yet: what if the kernel closed everything it had added via
> > ADDFD if it didn't get a valid response from the supervisor? Then
> > everyone gets this bug fixed for free.
> > 
> > Tycho
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
> 
> This doesn't solve the problem universally because of the (Go) preemption 
> problem. Unless we can guarantee that the supervisor can always handle the 
> request in fewer than 10ms, or if it implements resumption behaviour. I know 
> that resumption behaviour is a requirement no matter what, but the easier we can 
> make it to implement resumption, the better chance we are giving users to get 
> this right.

Doesn't automatic cleanup of fds make things easier? I'm not sure I
understand the argument.

I agree it doesn't fix the problem of uncooperative userspace.

Tycho

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition
  2020-12-01 13:13       ` Tycho Andersen
@ 2020-12-01 21:27         ` Kees Cook
  0 siblings, 0 replies; 8+ messages in thread
From: Kees Cook @ 2020-12-01 21:27 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Sargun Dhillon, Alban Crequy, Giuseppe Scrivano,
	Linux Containers, LKML, jannh

On Tue, Dec 01, 2020 at 08:13:34AM -0500, Tycho Andersen wrote:
> On Tue, Dec 01, 2020 at 01:08:25PM +0000, Sargun Dhillon wrote:
> > On Tue, Dec 01, 2020 at 07:41:05AM -0500, Tycho Andersen wrote:
> > > On Mon, Nov 30, 2020 at 06:20:09PM -0500, Tycho Andersen wrote:
> > > > Idea 1 sounds best to me, but maybe that's because it's the way I
> > > > originally did the fd support that never landed :)
> > > > 
> > > > But here's an Idea 4: we add a way to remotely close an fd (I don't
> > > > see that the current infra can do this, but perhaps I didn't look hard
> > > > enough), and then when you get ENOENT you have to close the fd. Of
> > > > course, this can't be via seccomp, so maybe it's even more racy.
> > > 
> > > Or better yet: what if the kernel closed everything it had added via
> > > ADDFD if it didn't get a valid response from the supervisor? Then
> > > everyone gets this bug fixed for free.
> > > 
> > > Tycho
> > > _______________________________________________
> > > Containers mailing list
> > > Containers@lists.linux-foundation.org
> > > https://lists.linuxfoundation.org/mailman/listinfo/containers
> > 
> > This doesn't solve the problem universally because of the (Go) preemption 
> > problem. Unless we can guarantee that the supervisor can always handle the 
> > request in fewer than 10ms, or if it implements resumption behaviour. I know 
> > that resumption behaviour is a requirement no matter what, but the easier we can 
> > make it to implement resumption, the better chance we are giving users to get 
> > this right.
> 
> Doesn't automatic cleanup of fds make things easier? I'm not sure I
> understand the argument.

I doubt Al would ever allow the "cleanup" approach: his observation was
that the instant a file has been added to the fdtable, it's not possible
to "unwind" that ever, since it could be cloned away, etc, etc.

> I agree it doesn't fix the problem of uncooperative userspace.

IIUC, I see two issues:

- a slow monitor might cause a child to loop forever retrying the same
  interrupted syscall.

- a syscall-interrupted process may have had an fd added that it has no
  idea about.

The former problem seems like a userspace issue. :P But, to help, yeah, is
signal blocking best? Either explicit (at filter apply time) or implicit
(all user_notif-triggering syscalls get all signals blocks automatically)?

For the latter problem, I think we need to get back to Tycho's original
method: add fd and finish syscall in a single action. I can't see any
other way to get around the need for atomicity...

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition
@ 2022-07-19  2:13 Robin Naccari
  0 siblings, 0 replies; 8+ messages in thread
From: Robin Naccari @ 2022-07-19  2:13 UTC (permalink / raw)
  To: Robin Coley
  Cc: alban, containers, gscrivan, jannh, linux-kernel, sargun, tycho




Sent from my iPad

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-07-19  2:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-26 13:09 SECCOMP_IOCTL_NOTIF_ADDFD race condition Alban Crequy
2020-11-30 23:20 ` Tycho Andersen
2020-12-01  4:08   ` Sargun Dhillon
2020-12-01 12:41   ` Tycho Andersen
2020-12-01 13:08     ` Sargun Dhillon
2020-12-01 13:13       ` Tycho Andersen
2020-12-01 21:27         ` Kees Cook
2022-07-19  2:13 Robin Naccari

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).