From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-io1-f44.google.com (mail-io1-f44.google.com [209.85.166.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 097752FAE for ; Mon, 24 May 2021 18:56:06 +0000 (UTC) Received: by mail-io1-f44.google.com with SMTP id o21so28885659iow.13 for ; Mon, 24 May 2021 11:56:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sargun.me; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=kY5ODL09+MFOG0z0UW2n/UhvHutj9fmmLO3qpZTedGA=; b=DU9Nb3d0eumYE6yPGePAk/liCz7kzb4TdqRZU3Q9TXKx1N0JT/l2RW1mQ5dVtqN8WV RfLtOktNcAG2XTXy0wmJy59K9Kj9T2JgvzslsaBWDS8Dug081GxE3ql9gr8eG0H8i/8W phWc3AWv+VY0QMEc1d8eIrsiprPFZogFOsvco= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=kY5ODL09+MFOG0z0UW2n/UhvHutj9fmmLO3qpZTedGA=; b=Q7+ICC9rgPzTlwYlM3j/xTDYS+4CIZX6H0TErd8j3+hG5kGmOlT2fUYd/3IlPQrcNb NlLRr+tQj++gqD6Vaz0vFTdTagGVEWvx8mTtuBoeLht8V5pzj5DAq1OeTVI8YsgtJhZg +SLiy8StwFdr5953jffruizTp8AQCB7zJzcBKybCasRupw8WambP0tZ1EQdXCK9on6da wviCpmVuG2dTaZYKVlR14B7hTjlOLladH8hTIjgrNs/BijJ1VVia+JBj0EJv+AQGPsEr +71uYm9+PjDrt/0axyBOwaU2zT1U0+YGC56nsSWDcat4T+BuMElSz9qTifrYXi4Q774S WMTQ== X-Gm-Message-State: AOAM530vyY1Zbr5UChm96gAA1blJayKTnpWUdRjFSWVp8Pa2TedEQCbc xWbeGf0KEnDfFtojwfEKZE7rW3xXepNF1CV3SOvrKg== X-Google-Smtp-Source: ABdhPJyU5QTyR2R5Ov3SlHc+M44rspLop3d/kpVd20Xusxd9ECjmltvb99VG+PRGEuN4LJN9mjVBUZxMjMroEa/D55c= X-Received: by 2002:a05:6602:2d8f:: with SMTP id k15mr16960080iow.114.1621882565864; Mon, 24 May 2021 11:56:05 -0700 (PDT) X-Mailing-List: containers@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: In-Reply-To: From: Sargun Dhillon Date: Mon, 24 May 2021 11:55:29 -0700 Message-ID: Subject: Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters To: Tianyin Xu Cc: Andy Lutomirski , YiFei Zhu , "containers@lists.linux.dev" , bpf , "Zhu, YiFei" , LSM List , Alexei Starovoitov , Andrea Arcangeli , "Kuo, Hsuan-Chi" , Claudio Canella , Daniel Borkmann , Daniel Gruss , Dimitrios Skarlatos , Giuseppe Scrivano , Hubertus Franke , Jann Horn , "Jia, Jinghao" , "Torrellas, Josep" , Kees Cook , Tobin Feldman-Fitzthum , Tom Hromatka , Will Drewry Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, May 20, 2021 at 1:22 AM Tianyin Xu wrote: > > On Mon, May 17, 2021 at 12:08 PM Sargun Dhillon wrote: > > > > While I agree with you that this is the case right now, there's no reas= on it > > has to be the case. There's a variety of mechanisms that can be employe= d > > to significantly speed up the performance of the notifier. For example,= right > > now the notifier is behind one large per-filter lock. That could be rem= oved > > allowing for better concurrency. There are a large number of mechanisms > > that scale O(n) with the outstanding notifications -- again, something > > that could be improved. > > Thanks for the pointer! But, I don=E2=80=99t think this can fundamentally > eliminate the performance gap between the notifiers and the ebpf > filters. IMHO, the additional context switches of user notifiers make > the difference. > I mean, I still think it can be closed. Or at least get better. I've thought about working on performance improvements, but they're lower on the list than functionality changes. > > > > The other big improvement that could be made is being able to use somet= hing > > like io_uring with the notifier interface, but it would require a > > fairly significant > > user API change -- and a move away from ioctl. I'm not sure if people a= re > > excited about that idea at the moment. > > > > Apologize that I don=E2=80=99t fully understand your proposal. My > understanding about io_uring is that it allows you to amortize the > cost of context switch but not eliminate it, unless you are willing to > dedicate a core for it. I still believe that, even with io_uring, user > notifiers are going to be much slower than eBPF filters. The notifier gets significantly slower as a function of the notifications. = If you have a large number of notifications in flight, or if you're trying to concurrently handle a large number of notifications, it gets slower. This is where something like io_uring is super useful in terms of reducing wakeups. Also, in the original futex2 patches, it had a mechanism to better handle (scheduling) of notifier like cases[1]. If the seccomp notifier did a simil= ar thing, we could see better performance. > > Btw, our patches are based on your patch set (thank you!). Are you > using user notifiers (with your improved version?) these days? It will > be nice to hear your opinions on ebpf filters. > I'm so glad that someone is picking up the work on this. > > > > > > > > > > >> eBPF doesn't really have a privilege model yet. There was a lon= g and > > > > >> disappointing thread about this awhile back. > > > > > > > > > > The idea is that =E2=80=9Cseccomp-eBPF does not make life easier = for an > > > > > adversary=E2=80=9D. Any attack an adversary could potentially uti= lize > > > > > seccomp-eBPF, they can do the same with other eBPF features, i.e.= it > > > > > would be an issue with eBPF in general rather than specifically > > > > > seccomp=E2=80=99s use of eBPF. > > > > > > > > > > Here it is referring to the helpers goes to the base > > > > > bpf_base_func_proto if the caller is unprivileged (!bpf_capable |= | > > > > > !perfmon_capable). In this case, if the adversary would utilize e= BPF > > > > > helpers to perform an attack, they could do it via another > > > > > unprivileged prog type. > > > > > > > > > > That said, there are a few additional helpers this patchset is ad= ding: > > > > > * get_current_uid_gid > > > > > * get_current_pid_tgid > > > > > These two provide public information (are namespaces a concern?= ). I > > > > > have no idea what kind of exploit it could add unless the adversa= ry > > > > > somehow side-channels the task_struct? But in that case, how is t= he > > > > > reading of task_struct different from how the rest of the kernel = is > > > > > reading task_struct? > > > > > > > > Yes, namespaces are a concern. This idea got mostly shot down for = kdbus > > > > (what ever happened to that?), and it likely has the same problems = for > > > > seccomp. > > > > So, we actually have a case where we want to inspect an argument -- We want to look at the FD number that's passed to the sendmsg syscall, and = then see if that's an AF_INET socket, and if it is, then pass back to notifier, otherwise allow it to continue through. This is an area where I can see eBPF being very useful. > > > > >> > > > > >> What is this for? > > > > > > > > > > Memory reading opens up lots of use cases. For example, logging w= hat > > > > > files are being opened without imposing too much performance pena= lty > > > > > from strace. Or as an accelerator for user notify emulation, wher= e > > > > > syscalls can be rejected on a fast path if we know the memory con= tents > > > > > does not satisfy certain conditions that user notify will check. > > > > > > > > > > > > > This has all kinds of race conditions. > > > > > > > > > > > > I hate to be a party pooper, but this patchset is going to very hig= h bar > > > > to acceptance. Right now, seccomp has a couple of excellent proper= ties: > > > > > > > > First, while it has limited expressiveness, it is simple enough tha= t the > > > > implementation can be easily understood and the scope for > > > > vulnerabilities that fall through the cracks of the seccomp sandbox > > > > model is low. Compare this to Windows' low-integrity/high-integrit= y > > > > sandbox system: there is a never ending string of sandbox escapes d= ue to > > > > token misuse, unexpected things at various integrity levels, etc. > > > > Seccomp doesn't have tokens or integrity levels, and these bugs don= 't > > > > happen. > > > > > > > > Second, seccomp works, almost unchanged, in a completely unprivileg= ed > > > > context. The last time making eBPF work sensibly in a less- or > > > > -unprivileged context, the maintainers mostly rejected the idea of > > > > developing/debugging a permission model for maps, cleaning up the b= pf > > > > object id system, etc. You are going to have a very hard time > > > > convincing the seccomp maintainers to let any of these mechanism > > > > interact with seccomp until the underlying permission model is in p= lace. > > > > > > > > --Andy > > > > > > Thanks for pointing out the tradeoff between expressiveness vs. simpl= icity. > > > > > > Note that we are _not_ proposing to replace cbpf, but propose to also > > > support ebpf filters. There certainly are use cases where cbpf is > > > sufficient, but there are also important use cases ebpf could make > > > life much easier. > > > > > > Most importantly, we strongly believe that ebpf filters can be > > > supported without reducing security. > > > > > > No worries about =E2=80=9Cparty pooping=E2=80=9D and we appreciate th= e feedback. We=E2=80=99d > > > love to hear concerns and collect feedback so we can address them to > > > hit that very high bar. > > > > > > > > > ~t > > > > > > -- > > > Tianyin Xu > > > University of Illinois at Urbana-Champaign > > > https://urldefense.com/v3/__https://tianyin.github.io/__;!!DZ3fjg!o4_= _Ob32oapUDg9_f6hzksoFiX9517CJ5-w8qtG9i-WKFs_xWbGQfUHpLjHjCddw$ > [1]: https://lore.kernel.org/lkml/20210215152404.250281-1-andrealmeid@colla= bora.com/T/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7DC44C2B9F8 for ; Mon, 24 May 2021 18:56:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5D88261413 for ; Mon, 24 May 2021 18:56:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233010AbhEXS5h (ORCPT ); Mon, 24 May 2021 14:57:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36762 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232983AbhEXS5g (ORCPT ); Mon, 24 May 2021 14:57:36 -0400 Received: from mail-io1-xd2d.google.com (mail-io1-xd2d.google.com [IPv6:2607:f8b0:4864:20::d2d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C6B8BC061756 for ; Mon, 24 May 2021 11:56:06 -0700 (PDT) Received: by mail-io1-xd2d.google.com with SMTP id z24so28892694ioi.3 for ; Mon, 24 May 2021 11:56:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sargun.me; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=kY5ODL09+MFOG0z0UW2n/UhvHutj9fmmLO3qpZTedGA=; b=DU9Nb3d0eumYE6yPGePAk/liCz7kzb4TdqRZU3Q9TXKx1N0JT/l2RW1mQ5dVtqN8WV RfLtOktNcAG2XTXy0wmJy59K9Kj9T2JgvzslsaBWDS8Dug081GxE3ql9gr8eG0H8i/8W phWc3AWv+VY0QMEc1d8eIrsiprPFZogFOsvco= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=kY5ODL09+MFOG0z0UW2n/UhvHutj9fmmLO3qpZTedGA=; b=F6XN1luzgR0lR0N/Uq4VXF5Oa4dCtjHeDa/yFW7kby5DnxyGHQqA4YG8GWCnsNNdQz 4nVbvrXnomtJ2ZgTa/2ddDCfNMR7o2nRQ2vLrUHKEAaMFXbtyp24xAuPO9sn+MftPlOx X/q12bgJ8MeBsOAY/NnwMgO00lpdEC/WpEsWWVAeJPssmgVesu7lK1F1BwQdSP/oc3YN cEN2gcClZGW0q23tOj4kB2wH05AEEh84F0WiysCywIhKdRKuvgy40SeTVyBTeVrZRTX/ 4NTLaTdbRnckwqW4b19L6n4GgRPebejbQyqL1+Uwlv8W2YIVXQ0ID05acEUx9IBzABM1 qPRA== X-Gm-Message-State: AOAM530L0aDNNZgynVdtwKerhJIVtaDxYLXcurVxWdH316eNSsErzjWJ qgWRE7oyfG8dwhQ21yho/pCwJ/TTsy6/PFNYG/PyVA== X-Google-Smtp-Source: ABdhPJyU5QTyR2R5Ov3SlHc+M44rspLop3d/kpVd20Xusxd9ECjmltvb99VG+PRGEuN4LJN9mjVBUZxMjMroEa/D55c= X-Received: by 2002:a05:6602:2d8f:: with SMTP id k15mr16960080iow.114.1621882565864; Mon, 24 May 2021 11:56:05 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Sargun Dhillon Date: Mon, 24 May 2021 11:55:29 -0700 Message-ID: Subject: Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters To: Tianyin Xu Cc: Andy Lutomirski , YiFei Zhu , "containers@lists.linux.dev" , bpf , "Zhu, YiFei" , LSM List , Alexei Starovoitov , Andrea Arcangeli , "Kuo, Hsuan-Chi" , Claudio Canella , Daniel Borkmann , Daniel Gruss , Dimitrios Skarlatos , Giuseppe Scrivano , Hubertus Franke , Jann Horn , "Jia, Jinghao" , "Torrellas, Josep" , Kees Cook , Tobin Feldman-Fitzthum , Tom Hromatka , Will Drewry Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Thu, May 20, 2021 at 1:22 AM Tianyin Xu wrote: > > On Mon, May 17, 2021 at 12:08 PM Sargun Dhillon wrote: > > > > While I agree with you that this is the case right now, there's no reas= on it > > has to be the case. There's a variety of mechanisms that can be employe= d > > to significantly speed up the performance of the notifier. For example,= right > > now the notifier is behind one large per-filter lock. That could be rem= oved > > allowing for better concurrency. There are a large number of mechanisms > > that scale O(n) with the outstanding notifications -- again, something > > that could be improved. > > Thanks for the pointer! But, I don=E2=80=99t think this can fundamentally > eliminate the performance gap between the notifiers and the ebpf > filters. IMHO, the additional context switches of user notifiers make > the difference. > I mean, I still think it can be closed. Or at least get better. I've thought about working on performance improvements, but they're lower on the list than functionality changes. > > > > The other big improvement that could be made is being able to use somet= hing > > like io_uring with the notifier interface, but it would require a > > fairly significant > > user API change -- and a move away from ioctl. I'm not sure if people a= re > > excited about that idea at the moment. > > > > Apologize that I don=E2=80=99t fully understand your proposal. My > understanding about io_uring is that it allows you to amortize the > cost of context switch but not eliminate it, unless you are willing to > dedicate a core for it. I still believe that, even with io_uring, user > notifiers are going to be much slower than eBPF filters. The notifier gets significantly slower as a function of the notifications. = If you have a large number of notifications in flight, or if you're trying to concurrently handle a large number of notifications, it gets slower. This is where something like io_uring is super useful in terms of reducing wakeups. Also, in the original futex2 patches, it had a mechanism to better handle (scheduling) of notifier like cases[1]. If the seccomp notifier did a simil= ar thing, we could see better performance. > > Btw, our patches are based on your patch set (thank you!). Are you > using user notifiers (with your improved version?) these days? It will > be nice to hear your opinions on ebpf filters. > I'm so glad that someone is picking up the work on this. > > > > > > > > > > >> eBPF doesn't really have a privilege model yet. There was a lon= g and > > > > >> disappointing thread about this awhile back. > > > > > > > > > > The idea is that =E2=80=9Cseccomp-eBPF does not make life easier = for an > > > > > adversary=E2=80=9D. Any attack an adversary could potentially uti= lize > > > > > seccomp-eBPF, they can do the same with other eBPF features, i.e.= it > > > > > would be an issue with eBPF in general rather than specifically > > > > > seccomp=E2=80=99s use of eBPF. > > > > > > > > > > Here it is referring to the helpers goes to the base > > > > > bpf_base_func_proto if the caller is unprivileged (!bpf_capable |= | > > > > > !perfmon_capable). In this case, if the adversary would utilize e= BPF > > > > > helpers to perform an attack, they could do it via another > > > > > unprivileged prog type. > > > > > > > > > > That said, there are a few additional helpers this patchset is ad= ding: > > > > > * get_current_uid_gid > > > > > * get_current_pid_tgid > > > > > These two provide public information (are namespaces a concern?= ). I > > > > > have no idea what kind of exploit it could add unless the adversa= ry > > > > > somehow side-channels the task_struct? But in that case, how is t= he > > > > > reading of task_struct different from how the rest of the kernel = is > > > > > reading task_struct? > > > > > > > > Yes, namespaces are a concern. This idea got mostly shot down for = kdbus > > > > (what ever happened to that?), and it likely has the same problems = for > > > > seccomp. > > > > So, we actually have a case where we want to inspect an argument -- We want to look at the FD number that's passed to the sendmsg syscall, and = then see if that's an AF_INET socket, and if it is, then pass back to notifier, otherwise allow it to continue through. This is an area where I can see eBPF being very useful. > > > > >> > > > > >> What is this for? > > > > > > > > > > Memory reading opens up lots of use cases. For example, logging w= hat > > > > > files are being opened without imposing too much performance pena= lty > > > > > from strace. Or as an accelerator for user notify emulation, wher= e > > > > > syscalls can be rejected on a fast path if we know the memory con= tents > > > > > does not satisfy certain conditions that user notify will check. > > > > > > > > > > > > > This has all kinds of race conditions. > > > > > > > > > > > > I hate to be a party pooper, but this patchset is going to very hig= h bar > > > > to acceptance. Right now, seccomp has a couple of excellent proper= ties: > > > > > > > > First, while it has limited expressiveness, it is simple enough tha= t the > > > > implementation can be easily understood and the scope for > > > > vulnerabilities that fall through the cracks of the seccomp sandbox > > > > model is low. Compare this to Windows' low-integrity/high-integrit= y > > > > sandbox system: there is a never ending string of sandbox escapes d= ue to > > > > token misuse, unexpected things at various integrity levels, etc. > > > > Seccomp doesn't have tokens or integrity levels, and these bugs don= 't > > > > happen. > > > > > > > > Second, seccomp works, almost unchanged, in a completely unprivileg= ed > > > > context. The last time making eBPF work sensibly in a less- or > > > > -unprivileged context, the maintainers mostly rejected the idea of > > > > developing/debugging a permission model for maps, cleaning up the b= pf > > > > object id system, etc. You are going to have a very hard time > > > > convincing the seccomp maintainers to let any of these mechanism > > > > interact with seccomp until the underlying permission model is in p= lace. > > > > > > > > --Andy > > > > > > Thanks for pointing out the tradeoff between expressiveness vs. simpl= icity. > > > > > > Note that we are _not_ proposing to replace cbpf, but propose to also > > > support ebpf filters. There certainly are use cases where cbpf is > > > sufficient, but there are also important use cases ebpf could make > > > life much easier. > > > > > > Most importantly, we strongly believe that ebpf filters can be > > > supported without reducing security. > > > > > > No worries about =E2=80=9Cparty pooping=E2=80=9D and we appreciate th= e feedback. We=E2=80=99d > > > love to hear concerns and collect feedback so we can address them to > > > hit that very high bar. > > > > > > > > > ~t > > > > > > -- > > > Tianyin Xu > > > University of Illinois at Urbana-Champaign > > > https://urldefense.com/v3/__https://tianyin.github.io/__;!!DZ3fjg!o4_= _Ob32oapUDg9_f6hzksoFiX9517CJ5-w8qtG9i-WKFs_xWbGQfUHpLjHjCddw$ > [1]: https://lore.kernel.org/lkml/20210215152404.250281-1-andrealmeid@colla= bora.com/T/