From: Andy Lutomirski <email@example.com>
To: Jann Horn <firstname.lastname@example.org>
Cc: Kees Cook <email@example.com>,
kernel list <firstname.lastname@example.org>,
Christian Brauner <email@example.com>,
Sargun Dhillon <firstname.lastname@example.org>,
Tycho Andersen <email@example.com>,
"zhujianwei (C)" <firstname.lastname@example.org>,
Dave Hansen <email@example.com>,
Matthew Wilcox <firstname.lastname@example.org>,
Andy Lutomirski <email@example.com>, Will Drewry <firstname.lastname@example.org>,
Shuah Khan <email@example.com>, Matt Denton <firstname.lastname@example.org>,
Chris Palmer <email@example.com>,
Jeffrey Vander Stoep <firstname.lastname@example.org>,
Aleksa Sarai <email@example.com>,
the arch/x86 maintainers <firstname.lastname@example.org>,
Linux Containers <email@example.com>,
Linux API <firstname.lastname@example.org>
Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps
Date: Tue, 16 Jun 2020 14:13:14 -0700 [thread overview]
Message-ID: <CE2F8139-4549-46CC-ABDB-3429361847A2@amacapital.net> (raw)
> On Jun 16, 2020, at 11:36 AM, Jann Horn <email@example.com> wrote:
> On Tue, Jun 16, 2020 at 5:49 PM Kees Cook <firstname.lastname@example.org> wrote:
>>> On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote:
>>> Wouldn't it be simpler to use a function that can run a subset of
>>> seccomp cBPF and bails out on anything that indicates that a syscall's
>>> handling is complex or on instructions it doesn't understand? For
>>> syscalls that have a fixed policy, a typical seccomp filter doesn't
>>> even use any of the BPF_ALU ops, the scratch space, or the X register;
>>> it just uses something like the following set of operations, which is
>>> easy to emulate without much code:
>>> BPF_LD | BPF_W | BPF_ABS
>>> BPF_JMP | BPF_JEQ | BPF_K
>>> BPF_JMP | BPF_JGE | BPF_K
>>> BPF_JMP | BPF_JGT | BPF_K
>>> BPF_JMP | BPF_JA
>>> BPF_RET | BPF_K
>> Initially, I started down this path. It needed a bit of plumbing into
>> BPF to better control the lifetime of the cBPF "saved original filter"
>> (normally used by CHECKPOINT_RESTORE uses)
> I don't think you need that? When a filter is added, you can compute
> the results of the added individual filter, and then merge the state.
>> and then I needed to keep
>> making exceptions (same list you have: ALU, X register, scratch, etc)
>> in the name of avoiding too much complexity in the emulator. I decided
>> I'd rather reuse the existing infrastructure to actually execute the
>> filter (no cBPF copy needed to be saved, no separate code, and full
>> instruction coverage).
> If you really think that this bit of emulation is so bad, you could
> also make a copy of the BPF filter in which you replace all load
> instructions from syscall arguments with "return NON_CONSTANT_RESULT",
> and then run that through the normal BPF infrastructure.
>>> Something like (completely untested):
>> I didn't actually finish going down the emulator path (I stopped right
>> around the time I verified that libseccomp does use BPF_ALU -- though
>> only BPF_AND), so I didn't actually evaluate the filter contents for other
>> filter builders (i.e. Chrome).
>> But, if BPF_ALU | BPF_AND were added to your code above, it would cover
>> everything libseccomp generates (which covers a lot of the seccomp
>> filters, e.g. systemd, docker). I just felt funny about an "incomplete"
>> Though now you've got me looking. It seems this is the core
>> of Chrome's BPF instruction generation:
>> It also uses ALU|AND, but adds JMP|JSET.
>> So... that's only 2 more instructions to cover what I think are likely
>> the two largest seccomp instruction generators.
>>> That way, you won't need any of this complicated architecture-specific stuff.
>> There are two arch-specific needs, and using a cBPF-subset emulator
>> just gets rid of the local TLB flush. The other part is distinguishing
>> the archs. Neither requirement is onerous (TLB flush usually just
>> needs little more than an extern, arch is already documented in the
>> per-arch syscall_get_arch()).
> But it's also somewhat layer-breaking and reliant on very specific
> assumptions. Normal kernel code doesn't mess around with page table
> magic, outside of very specific low-level things. And your method
> would break if the fixed-value members were not all packed together at
> the start of the structure.
> And from a hardening perspective: The more code we add that fiddles
> around with PTEs directly, rather than going through higher-level
> abstractions, the higher the chance that something gets horribly
> screwed up. For example, this bit from your patch looks *really*
> + preempt_disable();
> + set_pte_at(&init_mm, vaddr, ptep,
> + local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> + preempt_enable();
> First off, that set_pte_at() is just a memory write; I don't see why
> you put it inside a preempt_disable() region.
> But more importantly, sticking a local TLB flush inside a
> preempt_disable() region with nothing else in there looks really
> shady. How is that supposed to work? If we migrate from CPU0 to CPU1
> directly before this region, and then from CPU1 back to CPU0 directly
> afterwards, the local TLB flush will have no effect.
With my x86/mm maintainer hat on, this is highly questionable. Either the real API should be used, or there should be a sane API. The former will have really atrocious performance, and the latter would need some thought. Basically, if you pin entire process to one CPU, you can clear the dirty bit, flush, do some magic, and read it back. This is only valid if you have a short enough operation that running with preemption off is reasonable. Otherwise you need to arrange to flush when you schedule in, which could be done with a voluntary preemption style or with scheduler hooks.
I’m not convinced this is worthwhile.
next prev parent reply other threads:[~2020-06-16 21:13 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-16 7:49 [RFC][PATCH 0/8] seccomp: Implement constant action bitmaps Kees Cook
2020-06-16 7:49 ` [PATCH 1/8] selftests/seccomp: Improve calibration loop Kees Cook
2020-06-16 7:49 ` [PATCH 2/8] seccomp: Use pr_fmt Kees Cook
2020-06-16 7:49 ` [PATCH 3/8] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE Kees Cook
2020-06-16 16:56 ` Andy Lutomirski
2020-06-17 15:25 ` Jann Horn
2020-06-17 15:29 ` Andy Lutomirski
2020-06-17 15:31 ` Jann Horn
2020-06-16 7:49 ` [PATCH 4/8] seccomp: Implement constant action bitmaps Kees Cook
2020-06-16 12:14 ` Jann Horn
2020-06-16 15:48 ` Kees Cook
2020-06-16 18:36 ` Jann Horn
2020-06-16 18:49 ` Kees Cook
2020-06-16 21:13 ` Andy Lutomirski [this message]
2020-06-16 14:40 ` Dave Hansen
2020-06-16 16:01 ` Kees Cook
2020-06-16 7:49 ` [PATCH 5/8] selftests/seccomp: Compare bitmap vs filter overhead Kees Cook
2020-06-16 7:49 ` [PATCH 6/8] x86: Provide API for local kernel TLB flushing Kees Cook
2020-06-16 16:59 ` Andy Lutomirski
2020-06-16 18:37 ` Kees Cook
2020-06-16 7:49 ` [PATCH 7/8] x86: Enable seccomp constant action bitmaps Kees Cook
2020-06-16 7:49 ` [PATCH 8/8] [DEBUG] seccomp: Report bitmap coverage ranges Kees Cook
2020-06-16 17:01 ` [RFC][PATCH 0/8] seccomp: Implement constant action bitmaps Andy Lutomirski
2020-06-16 18:35 ` Kees Cook
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).