Re: seccomp feature development

From: Kees Cook <keescook@chromium.org>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>, Jann Horn <jannh@google.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Tycho Andersen <tycho@tycho.ws>,
	Sargun Dhillon <sargun@sargun.me>,
	Matt Denton <mpdenton@google.com>,
	Chris Palmer <palmer@google.com>,
	Jeffrey Vander Stoep <jeffv@google.com>,
	Linux Containers <containers@lists.linux-foundation.org>,
	Linux API <linux-api@vger.kernel.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>, bpf <bpf@vger.kernel.org>
Subject: Re: seccomp feature development
Date: Tue, 19 May 2020 14:40:00 -0700	[thread overview]
Message-ID: <202005191434.57253AD@keescook> (raw)
In-Reply-To: <CAADnVQKRCCHRQrNy=V7ue38skb8nKCczScpph2WFv7U_jsS3KQ@mail.gmail.com>

On Tue, May 19, 2020 at 09:18:47AM -0700, Alexei Starovoitov wrote:
> On Mon, May 18, 2020 at 7:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2020-05-19, Jann Horn <jannh@google.com> wrote:
> > > On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@chromium.org> wrote:
> > > > ## deep argument inspection
> > > >
> > > > The argument caching bit is, I think, rather mechanical in nature since
> > > > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > > > allocates seccomp_data (maybe going so far as to have it split across two
> > > > pages with the syscall argument struct always starting on the 2nd page
> > > > boundary), and copying the EA struct into that page, which will be both
> > > > used by the filter and by the syscall.
> > >
> > > We could also do the same kind of thing the eBPF verifier does in
> > > convert_ctx_accesses(), and rewrite the context accesses to actually
> > > go through two different pointers depending on the (constant) offset
> > > into seccomp_data.
> >
> > My main worry with this is that we'll need to figure out what kind of
> > offset mathematics are necessary to deal with pointers inside the
> > extensible struct. As a very ugly proposal, you could make it so that
> > you multiply the offset by PAGE_SIZE each time you want to dereference
> > the pointer at that offset (unless we want to add new opcodes to cBPF to
> > allow us to represent this).
> 
> Please don't. cbpf is frozen.

https://www.youtube.com/watch?v=L0MK7qz13bU

If the only workable design paths for deep arg inspection end up needing
BPF helpers, I would agree that it's time for seccomp to grow eBPF
language support. I'm still hoping there's a clean solution that doesn't
require a seccomp language extension.

> > > We don't need to actually zero-fill memory for this beyond what the
> > > kernel supports - AFAIK the existing APIs already say that passing a
> > > short length is equivalent to passing zeroes, so we can just replace
> > > all out-of-bounds loads with zeroing registers in the filter.
> > > The tricky case is what should happen if the userspace program passes
> > > in fields that the filter doesn't know about. The filter can see the
> > > length field passed in by userspace, and then just reject anything
> > > where the length field is bigger than the structure size the filter
> > > knows about. But maybe it'd be slightly nicer if there was an
> > > operation for "tell me whether everything starting at offset X is
> > > zeroes", so that if someone compiles with newer kernel headers where
> > > the struct is bigger, and leaves the new fields zeroed, the syscalls
> > > still go through an outdated filter properly.
> >
> > I think the best way of handling this (without breaking programs
> > senselessly) is to have filters essentially emulate
> > copy_struct_from_user() semantics -- which is along the lines of what
> > you've suggested.
> 
> and cpbf load instruction will become copy_from_user() underneath?

No, this was meaning internal checking about struct sizes needs to exist
(not the user copy parts).

> I don't see how that can work.
> Have you considered implications to jits, register usage, etc ?
> 
> ebpf will become sleepable soon. It will be able to do copy_from_user()
> and examine any level of user pointer dereference.
> toctou is still going to be a concern though,
> but such ebpf+copy_from_user analysis and syscall sandboxing
> will not need to change kernel code base around syscalls at all.
> No need to invent E-syscalls and all the rest I've seen in this thread.

To avoid the ToCToU, the seccomp infrastructure must do the
copy_from_user(), so there's not need for the sleepable stuff in seccomp
that I can see. The question is mainly one of flattening.

-- 
Kees Cook