Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development

From: Kees Cook <keescook@chromium.org>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, ksummit <ksummit-discuss@lists.linuxfoundation.org>
Subject: Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
Date: Wed, 20 May 2020 16:39:20 -0700	[thread overview]
Message-ID: <202005201540.EF1BD18B44@keescook> (raw)
In-Reply-To: <20200520221256.tzqkjpeswv3d6ne2@ast-mbp.dhcp.thefacebook.com>

On Wed, May 20, 2020 at 03:12:56PM -0700, Alexei Starovoitov wrote:
> On Wed, May 20, 2020 at 12:04:04PM -0700, Kees Cook wrote:
> > On Wed, May 20, 2020 at 11:27:03AM -0700, Linus Torvalds wrote:
> > > Don't make this some kind of abstract conceptual problem thing.
> > > Because it's not.
> > 
> > I have no intention of making this abstract (the requests for expanding
> > seccomp coverage have been for only a select class of syscalls, and
> > specifically clone3 and openat2) nor more complicated than it needs to be
> > (I regularly resist expanding the seccomp BPF dialect into eBPF).
> 
> Kees, since you've forked the thread I'm adding bpf mailing list back and
> re-iterating my point:
> ** Nack to cBPF extensions **

Yes, I know. I agreed[1] with you on this point.

> How that is relevant?
> You're proposing to add copy_from_user() to selected syscalls, like clone3,
> and present large __u32 array to cBPF program.
> In other words existing fixed sized 'struct seccomp_data' will become
> either variable length or jumbo fixed size like one page.
> In the fomer case it would mean that cBPF would need to be extended
> with variable length logic. Which in turn means it will suffer from
> spectre v1 issues.

I don't expect to need to do anything with variable lengths in the
seccomp BPF dialect. As I said in the other thread, if we are faced with
design trade-offs that require extending the seccomp filter language, we
would switch to eBPF.

> If you go with latter approach of presenting cBPF with giant
> 'struct seccomp_data + page' that extra page would need to be zeroed out
> before invocation of bpf program which will make seccomp even less usable
> that it is today. Currently it's slow and unusable in production datacenter.

Making universal declarations based on your opinion does not help
convince people of your position. Saying it's "unusable in production
datacenter" is perhaps true for you, but hardly true for the many
datacenters that do use it.

Additionally, we're obviously not interested in making seccomp _slower_.
The entire point of an investigation of the design is to examine our
options and find the right solution.

> People suggested for years to adopt eBPF in seccomp to accelerate it,
> but, as you confessed, you resisted and sounds like now you want to
> implement seccomp specific syscall bitmask?

Yes -- because it's an order of magnitude faster than even a single
instruction BPF seccomp filter. The vast majority of seccomp filters need
nothing more than a single yes/no, and right now the bulk of processing
time is spent running the BPF filter. I would prefer to avoid BPF
entirely where possible for seccomp.

> Which means more kernel code, more bugs, more security issues.

Right. This is a solid design principle, and one I agree with: avoid
adding code, keep things simple, everything will have bugs. And, as it
stands, seccomp has had a significantly safer history than eBPF, largely
due to its goal of staying as utterly small and simple as possible. I
don't intend to discard that stance, and it's why I would rather continue
to shield seccomp from the regularly occurring eBPF flaws.

> imo that's another reinvented wheel when eBPF can do it already. I don't think
> it's a good idea to add kernel code when eBPF-based solution exists and capable
> of examining any level of nested args.

Thanks to the neighboring thread here, the requirements no longer[2]
include nested args. Also, you're mixing bitmasks (to accelerate the
overwhelmingly common case) with the deep argument inspection (which is
a rare but needed case).

> > Perhaps the question is "how deeply does seccomp need to inspect?"
> > and maybe it does not get to see anything beyond just the "top level"
> > struct (i.e. struct clone_args) and all pointers within THAT become
> > opaque? That certainly simplifies the design.
> 
> clone3's 'struct clone_args' has set_tid pointer as a second level.
> I don't think that sticking to first level of pointers for this particular
> syscall will make seccomp filtering any more practical.

Yup, we all agree. :)

-Kees

[1] https://lore.kernel.org/lkml/202005191434.57253AD@keescook/
[2] https://lore.kernel.org/ksummit-discuss/20200520221256.tzqkjpeswv3d6ne2@ast-mbp.dhcp.thefacebook.com/T/#m01a045c8715cfff8399ba86171039110befecbcf

-- 
Kees Cook
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss