Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps

From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
To: YiFei Zhu <zhuyifei1999@gmail.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Kees Cook <keescook@chromium.org>,
	YiFei Zhu <yifeifz2@illinois.edu>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Giuseppe Scrivano <gscrivan@redhat.com>,
	Will Drewry <wad@chromium.org>, bpf <bpf@vger.kernel.org>,
	Jann Horn <jannh@google.com>,
	Linux API <linux-api@vger.kernel.org>,
	Linux Containers <containers@lists.linux-foundation.org>,
	Tobin Feldman-Fitzthum <tobin@ibm.com>,
	Hubertus Franke <frankeh@us.ibm.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Valentin Rothberg <vrothber@redhat.com>,
	Dimitrios Skarlatos <dskarlat@cs.cmu.edu>,
	Jack Chen <jianyan2@illinois.edu>,
	Josep Torrellas <torrella@illinois.edu>,
	Tianyin Xu <tyxu@illinois.edu>,
	kernel list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
Date: Fri, 25 Sep 2020 07:56:46 +0200	[thread overview]
Message-ID: <27b4ef86-fee5-fc35-993b-3352ce504c73@rasmusvillemoes.dk> (raw)
In-Reply-To: <CABqSeAQKksqM1SdsQMoR52AJ5CY0VE2tk8-TJaMuOrkCprQ0MQ@mail.gmail.com>

On 24/09/2020 15.58, YiFei Zhu wrote:
> On Thu, Sep 24, 2020 at 8:46 AM Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
>> But one thing I'm wondering about and I haven't seen addressed anywhere:
>> Why build the bitmap on the kernel side (with all the complexity of
>> having to emulate the filter for all syscalls)? Why can't userspace just
>> hand the kernel "here's a new filter: the syscalls in this bitmap are
>> always allowed noquestionsasked, for the rest, run this bpf". Sure, that
>> might require a new syscall or extending seccomp(2) somewhat, but isn't
>> that a _lot_ simpler? It would probably also mean that the bpf we do get
>> handed is a lot smaller. Userspace might need to pass a couple of
>> bitmaps, one for each relevant arch, but you get the overall idea.
> 
> Perhaps. The thing is, the current API expects any filter attaches to
> be "additive". If a new filter gets attached that says "disallow read"
> then no matter whatever has been attached already, "read" shall not be
> allowed at the next syscall, bypassing all previous allowlist bitmaps
> (so you need to emulate the bpf anyways here?). We should also not
> have a API that could let anyone escape the secomp jail. Say "prctl"
> is permitted but "read" is not permitted, one must not be allowed to
> attach a bitmap so that "read" now appears in the allowlist. The only
> way this could potentially work is to attach a BPF filter and a bitmap
> at the same time in the same syscall, which might mean API redesign?

Yes, the man page would read something like

       SECCOMP_SET_MODE_FILTER_BITMAP
              The system calls allowed are defined by a pointer to a
Berkeley Packet Filter (BPF) passed  via  args.
              This argument is a pointer to a struct sock_fprog_bitmap;

with that struct containing whatever information/extra pointers needed
for passing the bitmap(s) in addition to the bpf prog.

And SECCOMP_SET_MODE_FILTER would internally just be updated to work
as-if all-zero allow-bitmaps were passed along. The internal kernel
bitmap would just be the and of the bitmaps in the filter stack.

Sure, it's UAPI, so would certainly need more careful thought on details
of just how the arg struct looks like etc. etc., but I was wondering why
it hadn't been discussed at all.

>> I'm also a bit worried about the performance of doing that emulation;
>> that's constant extra overhead for, say, launching a docker container.
> 
> IMO, launching a docker container is so expensive this should be negligible.

Regardless, I'd like to see some numbers, certainly for the "how much
faster does a getpid() or read() or any of the other syscalls that
nobody disallows" get, but also "what's the cost of doing that emulation
at seccomp(2) time".

Rasmus