bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lennart Poettering <lennart@poettering.net>
To: Kees Cook <keescook@chromium.org>
Cc: "Alexei Starovoitov" <alexei.starovoitov@gmail.com>,
	"zhujianwei (C)" <zhujianwei7@huawei.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	Hehuazhen <hehuazhen@huawei.com>,
	"Christian Ehrhardt" <christian.ehrhardt@canonical.com>,
	"Zbigniew Jędrzejewski-Szmek" <zbyszek@in.waw.pl>
Subject: Re: new seccomp mode aims to improve performance
Date: Mon, 1 Jun 2020 12:11:37 +0200	[thread overview]
Message-ID: <20200601101137.GA121847@gardel-login> (raw)
In-Reply-To: <202005291043.A63D910A8@keescook>

On Fr, 29.05.20 12:27, Kees Cook (keescook@chromium.org) wrote:

> # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status
> Seccomp_filters:        32
> # grep SystemCall /lib/systemd/system/systemd-resolved.service
> SystemCallArchitectures=native
> SystemCallErrorNumber=EPERM
> SystemCallFilter=@system-service
> I'd like to better understand what they're doing, but haven't had time
> to dig in. (The systemd devel mailing list requires subscription, so
> I've directly CCed some systemd folks that have touched seccomp there
> recently. Hi! The starts of this thread is here[4].)

Hmm, so on x86-64 we try to install our seccomp filters three times:
for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32
syscall ABI. Not all of the filters we apply work on all ABIs though,
because syscalls are available on some but not others, or cannot
sensibly be matched on some (because of socketcall, ipc and such
multiplexed syscalls).

When we fist added support for seccomp filters to systemd we compiled
everything into a single filter, and let libseccomp apply it to
different archs. But that didn't work out, since libseccomp doesn't
tell use when it manages to apply a filter and when not, i.e. to which
arch it worked and to which arch it didn't. And since we have some
whitelist and some blacklist filters the internal fallback logic of
libsecccomp doesn't work for us either, since you never know what you
end up with. So we ended up breaking the different settings up into
individual filters, and apply them individually and separately for
each arch, so that we know exactly what we managed to install and what
not, and what we can then know will properly filter and can check in
our test suite.

Keeping the filters separate made things a lot easier and simpler to
debug, and our log output and testing became much less of a black
box. We know exactly what worked and what didn't, and our test
validate each filter.

For systemd-resolved we apply a bunch more filters than just those
that are result of SystemCallFilter= and SystemCallArchitectures=
(SystemCallFilter= itself synthesizes one filter per syscall ABI).

1. RestrictSUIDSGID= generates a seccomp filter to generated suid/sgid
   binaries, i.e. filters chmod() and related calls and their

2. LockPersonality= blocks personality() for most arguments

3. MemoryDenyWriteExecute= blocks mmap() and similar calls if the
   selected map has X and W set at the same time

4. RestrictRealtime= blocks sched_setscheulder() for most parameters

5. RestrictAddressFamilies= blocks socket() and related calls for
   various address families

6. ProtectKernelLogs= blocks the syslog() syscall for most parameters

7. ProtectKernelTunables= blocks the old _sysctl() syscall among some
   other things

8. RestrictNamespaces= blocks various unshare() and clone() bits

So yeah, if one turns on many of these options in services (and we
generally turn on everything we can for the services we ship) and then
multiply that by the archs you end up with quite a bunch.

If we wanted to optimize that in userspace, then libseccomp would have
to be improved quite substantially to let us know exactly what works
and what doesn't, and to have sane fallback both when building
whitelists and blacklists.

An easy improvement is probably if libseccomp would now start refusing
to install x32 seccomp filters altogether now that x32 is entirely
dead? Or are the entrypoints for x32 syscalls still available in the
kernel? How could userspace figure out if they are available? If
libseccomp doesn't want to add code for that, we probably could have
that in systemd itself too...


Lennart Poettering, Berlin

  parent reply	other threads:[~2020-06-01 10:17 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-29 12:48 new seccomp mode aims to improve performance zhujianwei (C)
2020-05-29 15:43 ` Alexei Starovoitov
2020-05-29 16:09   ` Kees Cook
2020-05-29 17:31     ` Alexei Starovoitov
2020-05-29 19:27     ` Kees Cook
2020-05-31 17:19       ` Alexei Starovoitov
2020-06-01 18:16         ` Kees Cook
2020-06-01  2:08       ` 答复: " zhujianwei (C)
2020-06-01  3:30         ` Alexei Starovoitov
2020-06-02  2:42           ` 答复: " zhujianwei (C)
2020-06-02  3:24             ` Alexei Starovoitov
2020-06-02 11:13               ` 答复: " zhujianwei (C)
2020-06-02 11:34               ` zhujianwei (C)
2020-06-02 18:32                 ` Kees Cook
2020-06-03  4:51                   ` 答复: " zhujianwei (C)
2020-06-01 10:11       ` Lennart Poettering [this message]
2020-06-01 12:32         ` Paul Moore
2020-06-02 12:53           ` Lennart Poettering
2020-06-02 15:03             ` Paul Moore
2020-06-02 18:39               ` Kees Cook
2020-06-01 18:21         ` Kees Cook
2020-06-02 12:44           ` Lennart Poettering
2020-06-02 18:37             ` Kees Cook
2020-06-16  6:00             ` Kees Cook

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200601101137.GA121847@gardel-login \
    --to=lennart@poettering.net \
    --cc=alexei.starovoitov@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=christian.ehrhardt@canonical.com \
    --cc=hehuazhen@huawei.com \
    --cc=keescook@chromium.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=zbyszek@in.waw.pl \
    --cc=zhujianwei7@huawei.com \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).