Linux-Security-Module Archive on lore.kernel.org
 help / color / Atom feed
From: Kees Cook <keescook@chromium.org>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: "zhujianwei (C)" <zhujianwei7@huawei.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"linux-security-module@vger.kernel.org"
	<linux-security-module@vger.kernel.org>,
	Hehuazhen <hehuazhen@huawei.com>,
	"Lennart Poettering" <lennart@poettering.net>,
	"Christian Ehrhardt" <christian.ehrhardt@canonical.com>,
	"Zbigniew Jędrzejewski-Szmek" <zbyszek@in.waw.pl>
Subject: Re: new seccomp mode aims to improve performance
Date: Fri, 29 May 2020 12:27:03 -0700
Message-ID: <202005291043.A63D910A8@keescook> (raw)
In-Reply-To: <202005290903.11E67AB0FD@keescook>

On Fri, May 29, 2020 at 09:09:28AM -0700, Kees Cook wrote:
> On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote:
> > I don't think your hunch at where cpu is spending cycles is correct.
> > Could you please do two experiments:
> > 1. try trivial seccomp bpf prog that simply returns 'allow'
> > 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code
> >   that returns 'allow' and make sure it's noinline or in a different C file,
> >   so that compiler doesn't optimize the whole seccomp_run_filters() into a nop.
> > 
> > Then measure performance of both.
> > I bet you'll see exactly the same numbers.
> 
> Android has already done this, it appeared to not be the same. Calling
> into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see
> if I can get you the numbers. I was frankly quite surprised -- I
> understood the bulk of the seccomp overhead to be in taking the TIF_WORK
> path, copying arguments, etc, but something else is going on there.

So while it's not the Android measurements, here's what I'm seeing on
x86_64 (this is hardly a perfect noiseless benchmark, but sampling error
appears to close to 1%):


net.core.bpf_jit_enable=0:

Benchmarking 16777216 samples...
10.633756139 - 0.004359714 = 10629396425
getpid native: 633 ns
23.008737499 - 10.633967641 = 12374769858
getpid RET_ALLOW 1 filter: 737 ns
36.723141843 - 23.008975696 = 13714166147
getpid RET_ALLOW 2 filters: 817 ns
47.751422021 - 36.723345630 = 11028076391
getpid BPF-less allow: 657 ns
Estimated total seccomp overhead for 1 filter: 104 ns
Estimated total seccomp overhead for 2 filters: 184 ns
Estimated seccomp per-filter overhead: 80 ns
Estimated seccomp entry overhead: 24 ns
Estimated BPF overhead per filter: 80 ns


net.core.bpf_jit_enable=1:
net.core.bpf_jit_harden=1:

Benchmarking 16777216 samples...
31.939978606 - 21.275190689 = 10664787917
getpid native: 635 ns
43.324592380 - 31.940794751 = 11383797629
getpid RET_ALLOW 1 filter: 678 ns
55.001650599 - 43.326293248 = 11675357351
getpid RET_ALLOW 2 filters: 695 ns
65.986452855 - 55.002249904 = 10984202951
getpid BPF-less allow: 654 ns
Estimated total seccomp overhead for 1 filter: 43 ns
Estimated total seccomp overhead for 2 filters: 60 ns
Estimated seccomp per-filter overhead: 17 ns
Estimated seccomp entry overhead: 26 ns
Estimated BPF overhead per filter: 24 ns


net.core.bpf_jit_enable=1:
net.core.bpf_jit_harden=0:

Benchmarking 16777216 samples...
10.684681435 - 0.004198682 = 10680482753
getpid native: 636 ns
22.050823167 - 10.685571417 = 11365251750
getpid RET_ALLOW 1 filter: 677 ns
33.714134291 - 22.051100183 = 11663034108
getpid RET_ALLOW 2 filters: 695 ns
44.793312551 - 33.714383001 = 11078929550
getpid BPF-less allow: 660 ns
Estimated total seccomp overhead for 1 filter: 41 ns
Estimated total seccomp overhead for 2 filters: 59 ns
Estimated seccomp per-filter overhead: 18 ns
Estimated seccomp entry overhead: 23 ns
Estimated BPF overhead per filter: 17 ns


The above is from my (very dangerous!) benchmarking patch[1].

So, with the layered nature of seccomp filters there's a reasonable gain
to be seen for a O(1) bitmap lookup to skip running even a single filter,
even for the fastest BPF mode.

Not that we need to optimize for the pathological case, but this would
be especially useful for cases like systemd, which appears to be
constructing seccomp filters very inefficiently maybe on a per-syscall[3]
basis? For example, systemd-resolved has 32 (!) seccomp filters
attached[2]:

# grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status
Seccomp_filters:        32

# grep SystemCall /lib/systemd/system/systemd-resolved.service
SystemCallArchitectures=native
SystemCallErrorNumber=EPERM
SystemCallFilter=@system-service

I'd like to better understand what they're doing, but haven't had time
to dig in. (The systemd devel mailing list requires subscription, so
I've directly CCed some systemd folks that have touched seccomp there
recently. Hi! The starts of this thread is here[4].)

-Kees

[1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=seccomp/benchmark-bpf&id=20cc7d8f4238ea3bc1798f204bb865f4994cca27
[2] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/seccomp&id=9d06f16f463cef5c445af9738efed2bfe4c64730
[3] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter=
[4] https://lore.kernel.org/bpf/c22a6c3cefc2412cad00ae14c1371711@huawei.com/

-- 
Kees Cook

  parent reply index

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-29 12:48 zhujianwei (C)
2020-05-29 15:43 ` Alexei Starovoitov
2020-05-29 16:09   ` Kees Cook
2020-05-29 17:31     ` Alexei Starovoitov
2020-05-29 19:27     ` Kees Cook [this message]
2020-05-31 17:19       ` Alexei Starovoitov
2020-06-01 18:16         ` Kees Cook
2020-06-01  2:08       ` 答复: " zhujianwei (C)
2020-06-01  3:30         ` Alexei Starovoitov
2020-06-02  2:42           ` 答复: " zhujianwei (C)
2020-06-02  3:24             ` Alexei Starovoitov
2020-06-02 11:13               ` 答复: " zhujianwei (C)
2020-06-02 11:34               ` zhujianwei (C)
2020-06-02 18:32                 ` Kees Cook
2020-06-03  4:51                   ` 答复: " zhujianwei (C)
2020-06-01 10:11       ` Lennart Poettering
2020-06-01 12:32         ` Paul Moore
2020-06-02 12:53           ` Lennart Poettering
2020-06-02 15:03             ` Paul Moore
2020-06-02 18:39               ` Kees Cook
2020-06-01 18:21         ` Kees Cook
2020-06-02 12:44           ` Lennart Poettering
2020-06-02 18:37             ` Kees Cook
2020-06-16  6:00             ` Kees Cook

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=202005291043.A63D910A8@keescook \
    --to=keescook@chromium.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=christian.ehrhardt@canonical.com \
    --cc=hehuazhen@huawei.com \
    --cc=lennart@poettering.net \
    --cc=linux-security-module@vger.kernel.org \
    --cc=zbyszek@in.waw.pl \
    --cc=zhujianwei7@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Security-Module Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-security-module/0 linux-security-module/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-security-module linux-security-module/ https://lore.kernel.org/linux-security-module \
		linux-security-module@vger.kernel.org
	public-inbox-index linux-security-module

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-security-module


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git