linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] seccomp: Erroneous return on interrupted addfd ioctl()
@ 2021-04-13 16:01 Rodrigo Campos
  2021-04-13 16:01 ` [PATCH 1/1] seccomp: Always "goto wait" if the list is empty Rodrigo Campos
  0 siblings, 1 reply; 4+ messages in thread
From: Rodrigo Campos @ 2021-04-13 16:01 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski, Will Drewry, linux-kernel, containers
  Cc: Rodrigo Campos, Sargun Dhillon, Mauricio Vásquez Bernal,
	Alban Crequy

Hi!

With Alban and Mauricio (on Cc), we are adding seccomp userspace notify support
to runc and found a kernel bug while testing it. The issue is the following: if
the addfd ioctl() in the seccomp agent is interrupted (like with SIGURG) then
the target process is erroneously and prematurely unblocked and a 0 is returned
to the target process.

This happens on all kernels that support addfd (>= v5.9). The problem is the
following, things should happen in this specific order to hit the bug.

The agent is written in go, so the runtime sends SIGURG quite often. If we are
interrupted when the addfd ioctl is in this wait[1]:
        ret = wait_for_completion_interruptible(&kaddfd.completion);

Then the "complete(&knotif->ready);" was just run. So, we proceed to take the
lock and delete the addfd element from the list[2]. The kaddfd.list is empty
now.

After that, the other side (the target process) will be woken up (because we
issued the "complete(&knotif->ready)") and take the lock, but the "if" to add
the fd will be false as the element is already deleted (the list is empty)[3].

Then, we don't execute the "goto wait" and we proceed to return to the other
side values set in n, that is 0 initialized. So, the other side sees 0 as a
response (error is 0 and val is 0).

The target process, then, sees a response even when we didn't answer the
notification, we just tried to add an fd.

I attach the strace output when hitting the bug. The target is issuing openat in
a loop, for "/dev/null-%d" with %d being the iteration we are at. The strace
shows that openat returned 0 for the iteration 27246.

The agent in go checks the notification is valid (SECCOMP_IOCTL_NOTIF_ID_VALID)
and then runs the addfd ioctl, that is interrupted by the go runtime. Then, it
is resumed and returns ENOENT. This makes sense, as the notification was already
responded (incorrectly) by the interrupted addfd ioctl, so when we run it again
now the notification doesn't exist.

There are two ways that come to mind to solve this problem. One is the patch in
the next email (tested, fixes the issue). The other one is the first patch in
the joint patchset that we sent with Sargun some days ago ("seccomp: Refactor
notification handler to prepare for new semantics")[4]. That patch alone also
fixes the issue.

Both patches are quite simple, although the one attached here doesn't change
semantics and the other ("seccomp: Refactor notification handler to prepare for
new semantics") does.

The patch that we apply should be cc stable too. This patch is based on
for-linus/seccomp, affect 5.9+ kernels and applies cleany to 5.10.y, 5.11.y and
current 5.12 (probably to others too, this part hasn't changed recently).

Rodrigo Campos (1):
  seccomp: Always "goto wait" if the list is empty

 kernel/seccomp.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)


[1]: https://github.com/torvalds/linux/blob/89698becf06d341a700913c3d89ce2a914af69a2/kernel/seccomp.c#L1615
[2]: https://github.com/torvalds/linux/blob/89698becf06d341a700913c3d89ce2a914af69a2/kernel/seccomp.c#L1628-L1639 
[3]: https://github.com/torvalds/linux/blob/89698becf06d341a700913c3d89ce2a914af69a2/kernel/seccomp.c#L1113 
[4]: https://lore.kernel.org/lkml/20210318051733.2544-2-sargun@sargun.me/ 

Strace on the target:
...
[pid  3211] openat(-1, "/dev/null-27246", O_RDONLY) = 0

Strace on the agent (golang):
...
[pid  3214] pread64(10, "/dev/null-27246\0001\371\1Ag\177\0\0\0\0\0\0\1\0\0\0"..., 4096, 140724461935344) = 4096
[pid  3214] close(10)                   = 0
[pid  3214] ioctl(4, SECCOMP_IOCTL_NOTIF_ID_VALID, 0x7f7a1a7fbdc8) = 0
[pid  3214] write(2, "\33[37mDEBU\33[0m[0091] doing replac"..., 66 <unfinished ...>
[pid  3141] <... nanosleep resumed>NULL) = 0
[pid  3214] <... write resumed>)        = 66
[pid  3141] getpid( <unfinished ...>
[pid  3214] write(1, "using notify_fd: 4, srcfd: 8\n", 29 <unfinished ...>
[pid  3141] <... getpid resumed>)       = 3140
[pid  3214] <... write resumed>)        = 29
[pid  3141] tgkill(3140, 3214, SIGURG <unfinished ...>
[pid  3214] ioctl(4, SECCOMP_IOCTL_NOTIF_ADDFD <unfinished ...>
[pid  3141] <... tgkill resumed>)       = 0
[pid  3214] <... ioctl resumed>, 0x7f7a1a7fbde0) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid  3141] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid  3214] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=3140, si_uid=0} ---
[pid  3214] rt_sigreturn({mask=[]})     = 16
[pid  3214] ioctl(4, SECCOMP_IOCTL_NOTIF_ADDFD, 0x7f7a1a7fbde0) = -1 ENOENT (No such file or directory)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/1] seccomp: Always "goto wait" if the list is empty
  2021-04-13 16:01 [PATCH 0/1] seccomp: Erroneous return on interrupted addfd ioctl() Rodrigo Campos
@ 2021-04-13 16:01 ` Rodrigo Campos
  2021-04-13 17:53   ` Christian Brauner
  0 siblings, 1 reply; 4+ messages in thread
From: Rodrigo Campos @ 2021-04-13 16:01 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski, Will Drewry, linux-kernel, containers
  Cc: Rodrigo Campos, Sargun Dhillon, Mauricio Vásquez Bernal,
	Alban Crequy, stable

It is possible for the thread with the seccomp filter attached (target)
to be waken up by an addfd message, but the list be empty. This happens
when the addfd ioctl on the other side (seccomp agent) is interrupted by
a signal such as SIGURG. In that case, the target erroneously and
prematurely returns from the syscall to userspace even though the
seccomp agent didn't ask for it.

This happens in the following scenario:

seccomp_notify_addfd()                                           | seccomp_do_user_notification()
                                                                 |
                                                                 |  err = wait_for_completion_interruptible(&n.ready);
 complete(&knotif->ready);                                       |
 ret = wait_for_completion_interruptible(&kaddfd.completion);    |
 // interrupted                                                  |
                                                                 |
 mutex_lock(&filter->notify_lock);                               |
 list_del(&kaddfd.list);                                         |
 mutex_unlock(&filter->notify_lock);                             |
                                                                 |  mutex_lock(&match->notify_lock);
                                                                 |  // This is false, addfd is false
                                                                 |  if (addfd && n.state != SECCOMP_NOTIFY_REPLIED)
                                                                 |
                                                                 |  ret = n.val;
                                                                 |  err = n.error;
                                                                 |  flags = n.flags;

So, the process blocked in seccomp_do_user_notification() will see a
response. As n is 0 initialized and wasn't set, it will see a 0 as
return value from the syscall.

The seccomp agent, when retrying the interrupted syscall, will see an
ENOENT error as the notification no longer exists (it was already
answered by this bug).

This patch fixes the issue by splitting the if in two parts: if we were
woken up and the state is not replied, we will always do a "goto wait".
And if that happens and there is an addfd element on the list, we will
add the fd before "goto wait".

This issue is present since 5.9, when addfd was added.

Fixes: 7cf97b1254550
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
---
 kernel/seccomp.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 63b40d12896b..1b34598f0e07 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1107,11 +1107,20 @@ static int seccomp_do_user_notification(int this_syscall,
 	err = wait_for_completion_interruptible(&n.ready);
 	mutex_lock(&match->notify_lock);
 	if (err == 0) {
-		/* Check if we were woken up by a addfd message */
-		addfd = list_first_entry_or_null(&n.addfd,
-						 struct seccomp_kaddfd, list);
-		if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
-			seccomp_handle_addfd(addfd);
+
+		if (n.state != SECCOMP_NOTIFY_REPLIED) {
+			/*
+			 * It is possible to be waken-up by an addfd message but
+			 * the list be empty. This can happen if the addfd
+			 * ioctl() is interrupted, as it deletes the element.
+			 *
+			 * So, check if indeed there is an element in the list.
+			 */
+			addfd = list_first_entry_or_null(&n.addfd,
+							 struct seccomp_kaddfd, list);
+			if (addfd)
+				seccomp_handle_addfd(addfd);
+
 			mutex_unlock(&match->notify_lock);
 			goto wait;
 		}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] seccomp: Always "goto wait" if the list is empty
  2021-04-13 16:01 ` [PATCH 1/1] seccomp: Always "goto wait" if the list is empty Rodrigo Campos
@ 2021-04-13 17:53   ` Christian Brauner
  2021-04-13 18:02     ` Rodrigo Campos
  0 siblings, 1 reply; 4+ messages in thread
From: Christian Brauner @ 2021-04-13 17:53 UTC (permalink / raw)
  To: Rodrigo Campos
  Cc: Kees Cook, Andy Lutomirski, Will Drewry, linux-kernel,
	containers, Sargun Dhillon, Mauricio Vásquez Bernal,
	Alban Crequy, stable

On Tue, Apr 13, 2021 at 06:01:51PM +0200, Rodrigo Campos wrote:
> It is possible for the thread with the seccomp filter attached (target)
> to be waken up by an addfd message, but the list be empty. This happens
> when the addfd ioctl on the other side (seccomp agent) is interrupted by
> a signal such as SIGURG. In that case, the target erroneously and
> prematurely returns from the syscall to userspace even though the
> seccomp agent didn't ask for it.
> 
> This happens in the following scenario:
> 
> seccomp_notify_addfd()                                           | seccomp_do_user_notification()
>                                                                  |
>                                                                  |  err = wait_for_completion_interruptible(&n.ready);
>  complete(&knotif->ready);                                       |
>  ret = wait_for_completion_interruptible(&kaddfd.completion);    |
>  // interrupted                                                  |
>                                                                  |
>  mutex_lock(&filter->notify_lock);                               |
>  list_del(&kaddfd.list);                                         |
>  mutex_unlock(&filter->notify_lock);                             |
>                                                                  |  mutex_lock(&match->notify_lock);
>                                                                  |  // This is false, addfd is false
>                                                                  |  if (addfd && n.state != SECCOMP_NOTIFY_REPLIED)
>                                                                  |
>                                                                  |  ret = n.val;
>                                                                  |  err = n.error;
>                                                                  |  flags = n.flags;
> 
> So, the process blocked in seccomp_do_user_notification() will see a
> response. As n is 0 initialized and wasn't set, it will see a 0 as
> return value from the syscall.
> 
> The seccomp agent, when retrying the interrupted syscall, will see an
> ENOENT error as the notification no longer exists (it was already
> answered by this bug).
> 
> This patch fixes the issue by splitting the if in two parts: if we were
> woken up and the state is not replied, we will always do a "goto wait".
> And if that happens and there is an addfd element on the list, we will
> add the fd before "goto wait".
> 
> This issue is present since 5.9, when addfd was added.
> 
> Fixes: 7cf97b1254550
> Cc: stable@vger.kernel.org # 5.9+
> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
> ---

So the agent will see the return value from
wait_for_completion_interruptible() and know that the addfd wasn't
successful and the target will notice that no addfd request has actually
been added and essentially try again. Seems like a decent fix and can be
backported cleanly. I assume seccomp testsuite passes.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] seccomp: Always "goto wait" if the list is empty
  2021-04-13 17:53   ` Christian Brauner
@ 2021-04-13 18:02     ` Rodrigo Campos
  0 siblings, 0 replies; 4+ messages in thread
From: Rodrigo Campos @ 2021-04-13 18:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Andy Lutomirski, Will Drewry, LKML, Linux Containers,
	Sargun Dhillon, Mauricio Vásquez Bernal, Alban Crequy,
	stable

On Tue, Apr 13, 2021 at 7:54 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> > Fixes: 7cf97b1254550
> > Cc: stable@vger.kernel.org # 5.9+
> > Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
> > ---
>
> So the agent will see the return value from
> wait_for_completion_interruptible() and know that the addfd wasn't
> successful and the target will notice that no addfd request has actually
> been added and essentially try again. Seems like a decent fix and can be

Yes, exactly!

> backported cleanly. I assume seccomp testsuite passes.

Yes, seccomp selftests (tools/testing/selftests/seccomp/seccomp_bpf) passes fine

> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

Thanks!

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-04-13 18:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-13 16:01 [PATCH 0/1] seccomp: Erroneous return on interrupted addfd ioctl() Rodrigo Campos
2021-04-13 16:01 ` [PATCH 1/1] seccomp: Always "goto wait" if the list is empty Rodrigo Campos
2021-04-13 17:53   ` Christian Brauner
2021-04-13 18:02     ` Rodrigo Campos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).