答复: new seccomp mode aims to improve performance

From: "zhujianwei (C)" <zhujianwei7@huawei.com>
To: Kees Cook <keescook@chromium.org>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: "bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"linux-security-module@vger.kernel.org"
	<linux-security-module@vger.kernel.org>,
	Hehuazhen <hehuazhen@huawei.com>,
	"Lennart Poettering" <lennart@poettering.net>,
	"Christian Ehrhardt" <christian.ehrhardt@canonical.com>,
	"Zbigniew Jędrzejewski-Szmek" <zbyszek@in.waw.pl>
Subject: 答复: new seccomp mode aims to improve performance
Date: Mon, 1 Jun 2020 02:08:05 +0000	[thread overview]
Message-ID: <ff10225b79a14fec9bc383e710d74b2e@huawei.com> (raw)
In-Reply-To: <202005291043.A63D910A8@keescook>

This is the test result on linux 5.7.0-rc7 for aarch64. 
And retpoline disabled default.
#cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
Not affected

bpf_jit_enable 1
bpf_jit_harden 0

We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one).
The unixbench syscall testcase runs 5 system calls（close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it） in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m):

original:
	seccomp_off:			10684939
	seccomp_on_1_filters:	8513805		overhead：19.8%
	seccomp_on_4_filters:	7105592		overhead：33.0%
	seccomp_on_32_filters:	2308677		overhead：78.3%

after replacing bpf_prog_run_pin_on_cpu:
	seccomp_off:			10685244
	seccomp_on_1_filters:	9146483		overhead：14.1%
	seccomp_on_4_filters:	8969886		overhead：16.0%
	seccomp_on_32_filters:	6454372		overhead：39.6%

N-filter bpf overhead:
	1_filters:		5.7%
	4_filters:		17.0%
	32_filters:	38.7%

// kernel code modification place 
static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx)
{
	return SECCOMP_RET_ALLOW;
}

static u32 seccomp_run_filters(const struct seccomp_data *sd,
			       struct seccomp_filter **match)
{
	u32 ret = SECCOMP_RET_ALLOW;
	...
	for (; f; f = f->prev) {
-		u32 cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd);
+		u32 cur_ret = bpf_prog_run_pin_on_cpu_allow(f->prog, sd);

		if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) {
			ret = cur_ret;
			*match = f;
		}
	}
	return ret;
}

// unixbench syscall testcase with seccomp enabled:

void set_allow_rules(scmp_filter_ctx ctx)
{
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(dup), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(getpid), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(getuid), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(umask), 0);

	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(seccomp), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(prctl), 0);

	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(faccessat), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(setitimer), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(rt_sigaction), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(execve), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mprotect), 0);

	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
}

int main(int argc, char *argv[])
{
	...
	duration = atoi(argv[1]);
	iter = 0;
	wake_me(duration, report);

	scmp_filter_ctx ctx;
	ctx = seccomp_init(SCMP_ACT_KILL);
	set_allow_rules(ctx);
	int cnt;
	for(cnt = 0; cnt < 32; cnt++)
		seccomp_load(ctx);

	while (1) {
		close(dup(0));
		getpid();
		getuid();
		umask(022);

		iter++;
   }
	...
}

-----邮件原件-----
发件人: bpf-owner@vger.kernel.org [mailto:bpf-owner@vger.kernel.org] 代表 Kees Cook
发送时间: 2020年5月30日 3:27
收件人: Alexei Starovoitov <alexei.starovoitov@gmail.com>
抄送: zhujianwei (C) <zhujianwei7@huawei.com>; bpf@vger.kernel.org; linux-security-module@vger.kernel.org; Hehuazhen <hehuazhen@huawei.com>; Lennart Poettering <lennart@poettering.net>; Christian Ehrhardt <christian.ehrhardt@canonical.com>; Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
主题: Re: new seccomp mode aims to improve performance

On Fri, May 29, 2020 at 09:09:28AM -0700, Kees Cook wrote:
> On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote:
> > I don't think your hunch at where cpu is spending cycles is correct.
> > Could you please do two experiments:
> > 1. try trivial seccomp bpf prog that simply returns 'allow'
> > 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code
> >   that returns 'allow' and make sure it's noinline or in a different C file,
> >   so that compiler doesn't optimize the whole seccomp_run_filters() into a nop.
> > 
> > Then measure performance of both.
> > I bet you'll see exactly the same numbers.
> 
> Android has already done this, it appeared to not be the same. Calling 
> into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see 
> if I can get you the numbers. I was frankly quite surprised -- I 
> understood the bulk of the seccomp overhead to be in taking the 
> TIF_WORK path, copying arguments, etc, but something else is going on there.

So while it's not the Android measurements, here's what I'm seeing on
x86_64 (this is hardly a perfect noiseless benchmark, but sampling error appears to close to 1%):

net.core.bpf_jit_enable=0:

Benchmarking 16777216 samples...
10.633756139 - 0.004359714 = 10629396425 getpid native: 633 ns
23.008737499 - 10.633967641 = 12374769858 getpid RET_ALLOW 1 filter: 737 ns
36.723141843 - 23.008975696 = 13714166147 getpid RET_ALLOW 2 filters: 817 ns
47.751422021 - 36.723345630 = 11028076391 getpid BPF-less allow: 657 ns Estimated total seccomp overhead for 1 filter: 104 ns Estimated total seccomp overhead for 2 filters: 184 ns Estimated seccomp per-filter overhead: 80 ns Estimated seccomp entry overhead: 24 ns Estimated BPF overhead per filter: 80 ns

net.core.bpf_jit_enable=1:
net.core.bpf_jit_harden=1:

Benchmarking 16777216 samples...
31.939978606 - 21.275190689 = 10664787917 getpid native: 635 ns
43.324592380 - 31.940794751 = 11383797629 getpid RET_ALLOW 1 filter: 678 ns
55.001650599 - 43.326293248 = 11675357351 getpid RET_ALLOW 2 filters: 695 ns
65.986452855 - 55.002249904 = 10984202951 getpid BPF-less allow: 654 ns Estimated total seccomp overhead for 1 filter: 43 ns Estimated total seccomp overhead for 2 filters: 60 ns Estimated seccomp per-filter overhead: 17 ns Estimated seccomp entry overhead: 26 ns Estimated BPF overhead per filter: 24 ns

net.core.bpf_jit_enable=1:
net.core.bpf_jit_harden=0:

Benchmarking 16777216 samples...
10.684681435 - 0.004198682 = 10680482753 getpid native: 636 ns
22.050823167 - 10.685571417 = 11365251750 getpid RET_ALLOW 1 filter: 677 ns
33.714134291 - 22.051100183 = 11663034108 getpid RET_ALLOW 2 filters: 695 ns
44.793312551 - 33.714383001 = 11078929550 getpid BPF-less allow: 660 ns Estimated total seccomp overhead for 1 filter: 41 ns Estimated total seccomp overhead for 2 filters: 59 ns Estimated seccomp per-filter overhead: 18 ns Estimated seccomp entry overhead: 23 ns Estimated BPF overhead per filter: 17 ns

The above is from my (very dangerous!) benchmarking patch[1].

So, with the layered nature of seccomp filters there's a reasonable gain to be seen for a O(1) bitmap lookup to skip running even a single filter, even for the fastest BPF mode.

Not that we need to optimize for the pathological case, but this would be especially useful for cases like systemd, which appears to be constructing seccomp filters very inefficiently maybe on a per-syscall[3] basis? For example, systemd-resolved has 32 (!) seccomp filters
attached[2]:

# grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status
Seccomp_filters:        32

# grep SystemCall /lib/systemd/system/systemd-resolved.service
SystemCallArchitectures=native
SystemCallErrorNumber=EPERM
SystemCallFilter=@system-service

I'd like to better understand what they're doing, but haven't had time to dig in. (The systemd devel mailing list requires subscription, so I've directly CCed some systemd folks that have touched seccomp there recently. Hi! The starts of this thread is here[4].)

-Kees

[1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=seccomp/benchmark-bpf&id=20cc7d8f4238ea3bc1798f204bb865f4994cca27
[2] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/seccomp&id=9d06f16f463cef5c445af9738efed2bfe4c64730
[3] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter=
[4] https://lore.kernel.org/bpf/c22a6c3cefc2412cad00ae14c1371711@huawei.com/

--
Kees Cook