linux-security-module.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* new seccomp mode aims to improve performance
@ 2020-05-29 12:48 zhujianwei (C)
  2020-05-29 15:43 ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: zhujianwei (C) @ 2020-05-29 12:48 UTC (permalink / raw)
  To: bpf, linux-security-module; +Cc: Hehuazhen

Hi, all

  We're using seccomp to increase container security, but bpf rules filter causes performance to deteriorate. So, is there a good solution to improve performance, or can we add a simplified seccomp mode to improve performance?
  
  // Pseudo code
  int __secure_computing(int this_syscall)
  {
  	...
  	switch (mode) {
  	case SECCOMP_MODE_STRICT:
  		...
  	case SECCOMP_MODE_FILTER:
  		...
  	case SECCOMP_MODE_LIGHT_FILTER:
  		//do light syscall filter.
  		...
  		break;
  	}
  	...
  }
  		
  int light_syscall_filter(int syscall_num) {
  	if(scno > SYSNUM_MAX) {
  		...
  		return -EACCESS;
  	}
  
  	bool *filter_map = get_filter_map(current);
  	if(filter_map == NULL) {
  		...
  		return -EFAULT;
  	}
  
  	if(filter_map[syscall_num] == true) {
  		...
  		return 0;
  	} else {
  		...
  		return -EACCESS;
  	}
  	...
  }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-05-29 12:48 new seccomp mode aims to improve performance zhujianwei (C)
@ 2020-05-29 15:43 ` Alexei Starovoitov
  2020-05-29 16:09   ` Kees Cook
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2020-05-29 15:43 UTC (permalink / raw)
  To: zhujianwei (C); +Cc: bpf, linux-security-module, Hehuazhen, Kees Cook

On Fri, May 29, 2020 at 5:50 AM zhujianwei (C) <zhujianwei7@huawei.com> wrote:
>
> Hi, all
>
>   We're using seccomp to increase container security, but bpf rules filter causes performance to deteriorate. So, is there a good solution to improve performance, or can we add a simplified seccomp mode to improve performance?

I don't think your hunch at where cpu is spending cycles is correct.
Could you please do two experiments:
1. try trivial seccomp bpf prog that simply returns 'allow'
2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code
  that returns 'allow' and make sure it's noinline or in a different C file,
  so that compiler doesn't optimize the whole seccomp_run_filters() into a nop.

Then measure performance of both.
I bet you'll see exactly the same numbers.
If you have retpoline on then bpf case will be slightly slower because
of retpoline cost.

Only after this experiment let's discuss the options about accelerating seccomp.

>
>   // Pseudo code
>   int __secure_computing(int this_syscall)
>   {
>         ...
>         switch (mode) {
>         case SECCOMP_MODE_STRICT:
>                 ...
>         case SECCOMP_MODE_FILTER:
>                 ...
>         case SECCOMP_MODE_LIGHT_FILTER:
>                 //do light syscall filter.
>                 ...
>                 break;
>         }
>         ...
>   }
>
>   int light_syscall_filter(int syscall_num) {
>         if(scno > SYSNUM_MAX) {
>                 ...
>                 return -EACCESS;
>         }
>
>         bool *filter_map = get_filter_map(current);
>         if(filter_map == NULL) {
>                 ...
>                 return -EFAULT;
>         }
>
>         if(filter_map[syscall_num] == true) {
>                 ...
>                 return 0;
>         } else {
>                 ...
>                 return -EACCESS;
>         }
>         ...
>   }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-05-29 15:43 ` Alexei Starovoitov
@ 2020-05-29 16:09   ` Kees Cook
  2020-05-29 17:31     ` Alexei Starovoitov
  2020-05-29 19:27     ` Kees Cook
  0 siblings, 2 replies; 24+ messages in thread
From: Kees Cook @ 2020-05-29 16:09 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: zhujianwei (C), bpf, linux-security-module, Hehuazhen

On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote:
> On Fri, May 29, 2020 at 5:50 AM zhujianwei (C) <zhujianwei7@huawei.com> wrote:
> >
> > Hi, all
> >
> >   We're using seccomp to increase container security, but bpf rules filter causes performance to deteriorate. So, is there a good solution to improve performance, or can we add a simplified seccomp mode to improve performance?

Yes, there are already plans for a simple syscall bitmap[1] seccomp feature.

> I don't think your hunch at where cpu is spending cycles is correct.
> Could you please do two experiments:
> 1. try trivial seccomp bpf prog that simply returns 'allow'
> 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code
>   that returns 'allow' and make sure it's noinline or in a different C file,
>   so that compiler doesn't optimize the whole seccomp_run_filters() into a nop.
> 
> Then measure performance of both.
> I bet you'll see exactly the same numbers.

Android has already done this, it appeared to not be the same. Calling
into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see
if I can get you the numbers. I was frankly quite surprised -- I
understood the bulk of the seccomp overhead to be in taking the TIF_WORK
path, copying arguments, etc, but something else is going on there.

-Kees

[1] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-05-29 16:09   ` Kees Cook
@ 2020-05-29 17:31     ` Alexei Starovoitov
  2020-05-29 19:27     ` Kees Cook
  1 sibling, 0 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2020-05-29 17:31 UTC (permalink / raw)
  To: Kees Cook; +Cc: zhujianwei (C), bpf, linux-security-module, Hehuazhen

On Fri, May 29, 2020 at 9:09 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote:
> > On Fri, May 29, 2020 at 5:50 AM zhujianwei (C) <zhujianwei7@huawei.com> wrote:
> > >
> > > Hi, all
> > >
> > >   We're using seccomp to increase container security, but bpf rules filter causes performance to deteriorate. So, is there a good solution to improve performance, or can we add a simplified seccomp mode to improve performance?
>
> Yes, there are already plans for a simple syscall bitmap[1] seccomp feature.
>
> > I don't think your hunch at where cpu is spending cycles is correct.
> > Could you please do two experiments:
> > 1. try trivial seccomp bpf prog that simply returns 'allow'
> > 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code
> >   that returns 'allow' and make sure it's noinline or in a different C file,
> >   so that compiler doesn't optimize the whole seccomp_run_filters() into a nop.
> >
> > Then measure performance of both.
> > I bet you'll see exactly the same numbers.
>
> Android has already done this, it appeared to not be the same. Calling
> into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see
> if I can get you the numbers. I was frankly quite surprised -- I
> understood the bulk of the seccomp overhead to be in taking the TIF_WORK
> path, copying arguments, etc, but something else is going on there.

Kees,

Please show the numbers that prove your point.
I've seen people making this mistake over and over again.
Intel folks also said that calling into bpf is slow only to be proved wrong.
It turned out to be the cost of retpoline and bpf_dispatcher logic resolved it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-05-29 16:09   ` Kees Cook
  2020-05-29 17:31     ` Alexei Starovoitov
@ 2020-05-29 19:27     ` Kees Cook
  2020-05-31 17:19       ` Alexei Starovoitov
                         ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Kees Cook @ 2020-05-29 19:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Lennart Poettering,
	Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek

On Fri, May 29, 2020 at 09:09:28AM -0700, Kees Cook wrote:
> On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote:
> > I don't think your hunch at where cpu is spending cycles is correct.
> > Could you please do two experiments:
> > 1. try trivial seccomp bpf prog that simply returns 'allow'
> > 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code
> >   that returns 'allow' and make sure it's noinline or in a different C file,
> >   so that compiler doesn't optimize the whole seccomp_run_filters() into a nop.
> > 
> > Then measure performance of both.
> > I bet you'll see exactly the same numbers.
> 
> Android has already done this, it appeared to not be the same. Calling
> into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see
> if I can get you the numbers. I was frankly quite surprised -- I
> understood the bulk of the seccomp overhead to be in taking the TIF_WORK
> path, copying arguments, etc, but something else is going on there.

So while it's not the Android measurements, here's what I'm seeing on
x86_64 (this is hardly a perfect noiseless benchmark, but sampling error
appears to close to 1%):


net.core.bpf_jit_enable=0:

Benchmarking 16777216 samples...
10.633756139 - 0.004359714 = 10629396425
getpid native: 633 ns
23.008737499 - 10.633967641 = 12374769858
getpid RET_ALLOW 1 filter: 737 ns
36.723141843 - 23.008975696 = 13714166147
getpid RET_ALLOW 2 filters: 817 ns
47.751422021 - 36.723345630 = 11028076391
getpid BPF-less allow: 657 ns
Estimated total seccomp overhead for 1 filter: 104 ns
Estimated total seccomp overhead for 2 filters: 184 ns
Estimated seccomp per-filter overhead: 80 ns
Estimated seccomp entry overhead: 24 ns
Estimated BPF overhead per filter: 80 ns


net.core.bpf_jit_enable=1:
net.core.bpf_jit_harden=1:

Benchmarking 16777216 samples...
31.939978606 - 21.275190689 = 10664787917
getpid native: 635 ns
43.324592380 - 31.940794751 = 11383797629
getpid RET_ALLOW 1 filter: 678 ns
55.001650599 - 43.326293248 = 11675357351
getpid RET_ALLOW 2 filters: 695 ns
65.986452855 - 55.002249904 = 10984202951
getpid BPF-less allow: 654 ns
Estimated total seccomp overhead for 1 filter: 43 ns
Estimated total seccomp overhead for 2 filters: 60 ns
Estimated seccomp per-filter overhead: 17 ns
Estimated seccomp entry overhead: 26 ns
Estimated BPF overhead per filter: 24 ns


net.core.bpf_jit_enable=1:
net.core.bpf_jit_harden=0:

Benchmarking 16777216 samples...
10.684681435 - 0.004198682 = 10680482753
getpid native: 636 ns
22.050823167 - 10.685571417 = 11365251750
getpid RET_ALLOW 1 filter: 677 ns
33.714134291 - 22.051100183 = 11663034108
getpid RET_ALLOW 2 filters: 695 ns
44.793312551 - 33.714383001 = 11078929550
getpid BPF-less allow: 660 ns
Estimated total seccomp overhead for 1 filter: 41 ns
Estimated total seccomp overhead for 2 filters: 59 ns
Estimated seccomp per-filter overhead: 18 ns
Estimated seccomp entry overhead: 23 ns
Estimated BPF overhead per filter: 17 ns


The above is from my (very dangerous!) benchmarking patch[1].

So, with the layered nature of seccomp filters there's a reasonable gain
to be seen for a O(1) bitmap lookup to skip running even a single filter,
even for the fastest BPF mode.

Not that we need to optimize for the pathological case, but this would
be especially useful for cases like systemd, which appears to be
constructing seccomp filters very inefficiently maybe on a per-syscall[3]
basis? For example, systemd-resolved has 32 (!) seccomp filters
attached[2]:

# grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status
Seccomp_filters:        32

# grep SystemCall /lib/systemd/system/systemd-resolved.service
SystemCallArchitectures=native
SystemCallErrorNumber=EPERM
SystemCallFilter=@system-service

I'd like to better understand what they're doing, but haven't had time
to dig in. (The systemd devel mailing list requires subscription, so
I've directly CCed some systemd folks that have touched seccomp there
recently. Hi! The starts of this thread is here[4].)

-Kees

[1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=seccomp/benchmark-bpf&id=20cc7d8f4238ea3bc1798f204bb865f4994cca27
[2] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/seccomp&id=9d06f16f463cef5c445af9738efed2bfe4c64730
[3] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter=
[4] https://lore.kernel.org/bpf/c22a6c3cefc2412cad00ae14c1371711@huawei.com/

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-05-29 19:27     ` Kees Cook
@ 2020-05-31 17:19       ` Alexei Starovoitov
  2020-06-01 18:16         ` Kees Cook
  2020-06-01  2:08       ` 答复: " zhujianwei (C)
  2020-06-01 10:11       ` Lennart Poettering
  2 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2020-05-31 17:19 UTC (permalink / raw)
  To: Kees Cook
  Cc: zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Lennart Poettering,
	Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek, daniel,
	netdev

On Fri, May 29, 2020 at 12:27:03PM -0700, Kees Cook wrote:
> On Fri, May 29, 2020 at 09:09:28AM -0700, Kees Cook wrote:
> > On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote:
> > > I don't think your hunch at where cpu is spending cycles is correct.
> > > Could you please do two experiments:
> > > 1. try trivial seccomp bpf prog that simply returns 'allow'
> > > 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code
> > >   that returns 'allow' and make sure it's noinline or in a different C file,
> > >   so that compiler doesn't optimize the whole seccomp_run_filters() into a nop.
> > > 
> > > Then measure performance of both.
> > > I bet you'll see exactly the same numbers.
> > 
> > Android has already done this, it appeared to not be the same. Calling
> > into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see
> > if I can get you the numbers. I was frankly quite surprised -- I
> > understood the bulk of the seccomp overhead to be in taking the TIF_WORK
> > path, copying arguments, etc, but something else is going on there.
> 
> So while it's not the Android measurements, here's what I'm seeing on
> x86_64 (this is hardly a perfect noiseless benchmark, but sampling error
> appears to close to 1%):
> 
> 
> net.core.bpf_jit_enable=0:
> 
> Benchmarking 16777216 samples...
> 10.633756139 - 0.004359714 = 10629396425
> getpid native: 633 ns
> 23.008737499 - 10.633967641 = 12374769858
> getpid RET_ALLOW 1 filter: 737 ns
> 36.723141843 - 23.008975696 = 13714166147
> getpid RET_ALLOW 2 filters: 817 ns
> 47.751422021 - 36.723345630 = 11028076391
> getpid BPF-less allow: 657 ns
> Estimated total seccomp overhead for 1 filter: 104 ns
> Estimated total seccomp overhead for 2 filters: 184 ns
> Estimated seccomp per-filter overhead: 80 ns
> Estimated seccomp entry overhead: 24 ns
> Estimated BPF overhead per filter: 80 ns
> 
> 
> net.core.bpf_jit_enable=1:
> net.core.bpf_jit_harden=1:
> 
> Benchmarking 16777216 samples...
> 31.939978606 - 21.275190689 = 10664787917
> getpid native: 635 ns
> 43.324592380 - 31.940794751 = 11383797629
> getpid RET_ALLOW 1 filter: 678 ns
> 55.001650599 - 43.326293248 = 11675357351
> getpid RET_ALLOW 2 filters: 695 ns
> 65.986452855 - 55.002249904 = 10984202951
> getpid BPF-less allow: 654 ns
> Estimated total seccomp overhead for 1 filter: 43 ns
> Estimated total seccomp overhead for 2 filters: 60 ns
> Estimated seccomp per-filter overhead: 17 ns
> Estimated seccomp entry overhead: 26 ns
> Estimated BPF overhead per filter: 24 ns
> 
> 
> net.core.bpf_jit_enable=1:
> net.core.bpf_jit_harden=0:
> 
> Benchmarking 16777216 samples...
> 10.684681435 - 0.004198682 = 10680482753
> getpid native: 636 ns
> 22.050823167 - 10.685571417 = 11365251750
> getpid RET_ALLOW 1 filter: 677 ns
> 33.714134291 - 22.051100183 = 11663034108
> getpid RET_ALLOW 2 filters: 695 ns
> 44.793312551 - 33.714383001 = 11078929550
> getpid BPF-less allow: 660 ns
> Estimated total seccomp overhead for 1 filter: 41 ns
> Estimated total seccomp overhead for 2 filters: 59 ns
> Estimated seccomp per-filter overhead: 18 ns
> Estimated seccomp entry overhead: 23 ns
> Estimated BPF overhead per filter: 17 ns
> 
> 
> The above is from my (very dangerous!) benchmarking patch[1].

Thank you for crafting a benchmark.
The only thing that it's not doing a fair comparison.
The problem with that patch [1] that is using:

static noinline u32 __seccomp_benchmark(struct bpf_prog *prog,
                                        const struct seccomp_data *sd)
{
        return SECCOMP_RET_ALLOW;
}

as a benchmarking function.
The 'noinline' keyword tells the compiler to keep the body of the function, but
the compiler is still doing full control and data flow analysis though this
function and it is smart enough to optimize its usage in seccomp_run_filters()
and in __seccomp_filter() because all functions are in a single .c file.
Lots of code gets optimized away when 'f->benchmark' is on.

To make it into fair comparison I've added the following patch
on top of your [1].

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 2fdbf5ad8372..86204422e096 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -244,7 +244,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
        return 0;
 }

-static noinline u32 __seccomp_benchmark(struct bpf_prog *prog,
+__weak noinline u32 __seccomp_benchmark(struct bpf_prog *prog,
                                        const struct seccomp_data *sd)

Please take a look at 'make kernel/seccomp.s' before and after to see the difference
__weak keyword makes.
And here is what seccomp_benchmark now reports:

Benchmarking 33554432 samples...
22.618269641 - 15.030812794 = 7587456847
getpid native: 226 ns
30.792042986 - 22.619048831 = 8172994155
getpid RET_ALLOW 1 filter: 243 ns
39.451435038 - 30.792836778 = 8658598260
getpid RET_ALLOW 2 filters: 258 ns
47.616011529 - 39.452190830 = 8163820699
getpid BPF-less allow: 243 ns
Estimated total seccomp overhead for 1 filter: 17 ns
Estimated total seccomp overhead for 2 filters: 32 ns
Estimated seccomp per-filter overhead: 15 ns
Estimated seccomp entry overhead: 2 ns
Estimated BPF overhead per filter: 0 ns

Depending on the run BPF-less mode would be slower than with BPF ;)

Benchmarking 33554432 samples...
22.602737193 - 15.078827612 = 7523909581
getpid native: 224 ns
30.734009056 - 22.603540911 = 8130468145
getpid RET_ALLOW 1 filter: 242 ns
39.106701659 - 30.734762631 = 8371939028
getpid RET_ALLOW 2 filters: 249 ns
47.326509567 - 39.107552786 = 8218956781
getpid BPF-less allow: 244 ns
Estimated total seccomp overhead for 1 filter: 18 ns
Estimated total seccomp overhead for 2 filters: 25 ns
Estimated seccomp per-filter overhead: 7 ns
Estimated seccomp entry overhead: 11 ns
Estimated BPF overhead per filter: 18446744073709551614 ns

Above numbers were obtained on non-debug kernel with retpoline off
and net.core.bpf_jit_enable=1.
When retpoline=y the f->benchmark mode will be slightly faster
due to retpoline overhead.
If retpoline is a concern the bpf dispatcher logic can be applied to seccomp.
It eliminiated retpoline overhead in XDP/bpf fast path.

> So, with the layered nature of seccomp filters there's a reasonable gain
> to be seen for a O(1) bitmap lookup to skip running even a single filter,
> even for the fastest BPF mode.

This is not true.
The O(1) bitmap implemented as kernel C code will have exactly the same speed
as O(1) bitmap implemented as eBPF program.

> Not that we need to optimize for the pathological case, but this would
> be especially useful for cases like systemd, which appears to be
> constructing seccomp filters very inefficiently maybe on a per-syscall[3]
> basis? For example, systemd-resolved has 32 (!) seccomp filters
> attached[2]:
> 
> # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status
> Seccomp_filters:        32
> 
> # grep SystemCall /lib/systemd/system/systemd-resolved.service
> SystemCallArchitectures=native
> SystemCallErrorNumber=EPERM
> SystemCallFilter=@system-service
> 
> I'd like to better understand what they're doing, but haven't had time
> to dig in. (The systemd devel mailing list requires subscription, so
> I've directly CCed some systemd folks that have touched seccomp there
> recently. Hi! The starts of this thread is here[4].)

32 seccomp filters sounds like a lot.
Would be great to find out what are they doing and whether
they can be optimized into much shorter and faster eBPF program.

> -Kees
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=seccomp/benchmark-bpf&id=20cc7d8f4238ea3bc1798f204bb865f4994cca27
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/seccomp&id=9d06f16f463cef5c445af9738efed2bfe4c64730
> [3] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter=
> [4] https://lore.kernel.org/bpf/c22a6c3cefc2412cad00ae14c1371711@huawei.com/
> 
> -- 
> Kees Cook

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* 答复: new seccomp mode aims to improve performance
  2020-05-29 19:27     ` Kees Cook
  2020-05-31 17:19       ` Alexei Starovoitov
@ 2020-06-01  2:08       ` zhujianwei (C)
  2020-06-01  3:30         ` Alexei Starovoitov
  2020-06-01 10:11       ` Lennart Poettering
  2 siblings, 1 reply; 24+ messages in thread
From: zhujianwei (C) @ 2020-06-01  2:08 UTC (permalink / raw)
  To: Kees Cook, Alexei Starovoitov
  Cc: bpf, linux-security-module, Hehuazhen, Lennart Poettering,
	Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek

This is the test result on linux 5.7.0-rc7 for aarch64. 
And retpoline disabled default.
#cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
Not affected

bpf_jit_enable 1
bpf_jit_harden 0

We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one).
The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m):

original:
	seccomp_off:			10684939
	seccomp_on_1_filters:	8513805		overhead:19.8%
	seccomp_on_4_filters:	7105592		overhead:33.0%
	seccomp_on_32_filters:	2308677		overhead:78.3%
	
after replacing bpf_prog_run_pin_on_cpu:
	seccomp_off:			10685244
	seccomp_on_1_filters:	9146483		overhead:14.1%
	seccomp_on_4_filters:	8969886		overhead:16.0%
	seccomp_on_32_filters:	6454372		overhead:39.6%

N-filter bpf overhead:
	1_filters:		5.7%
	4_filters:		17.0%
	32_filters:	38.7%

// kernel code modification place 
static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx)
{
	return SECCOMP_RET_ALLOW;
}

static u32 seccomp_run_filters(const struct seccomp_data *sd,
			       struct seccomp_filter **match)
{
	u32 ret = SECCOMP_RET_ALLOW;
	...
	for (; f; f = f->prev) {
-		u32 cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd);
+		u32 cur_ret = bpf_prog_run_pin_on_cpu_allow(f->prog, sd);

		if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) {
			ret = cur_ret;
			*match = f;
		}
	}
	return ret;
}

// unixbench syscall testcase with seccomp enabled:

void set_allow_rules(scmp_filter_ctx ctx)
{
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(dup), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(getpid), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(getuid), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(umask), 0);
	
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(seccomp), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(prctl), 0);
	
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(faccessat), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(setitimer), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(rt_sigaction), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(execve), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mprotect), 0);
	
	seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
}

int main(int argc, char *argv[])
{
	...
	duration = atoi(argv[1]);
	iter = 0;
	wake_me(duration, report);

	scmp_filter_ctx ctx;
	ctx = seccomp_init(SCMP_ACT_KILL);
	set_allow_rules(ctx);
	int cnt;
	for(cnt = 0; cnt < 32; cnt++)
		seccomp_load(ctx);

	while (1) {
		close(dup(0));
		getpid();
		getuid();
		umask(022);
		
		iter++;
   }
	...
}

-----邮件原件-----
发件人: bpf-owner@vger.kernel.org [mailto:bpf-owner@vger.kernel.org] 代表 Kees Cook
发送时间: 2020年5月30日 3:27
收件人: Alexei Starovoitov <alexei.starovoitov@gmail.com>
抄送: zhujianwei (C) <zhujianwei7@huawei.com>; bpf@vger.kernel.org; linux-security-module@vger.kernel.org; Hehuazhen <hehuazhen@huawei.com>; Lennart Poettering <lennart@poettering.net>; Christian Ehrhardt <christian.ehrhardt@canonical.com>; Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
主题: Re: new seccomp mode aims to improve performance

On Fri, May 29, 2020 at 09:09:28AM -0700, Kees Cook wrote:
> On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote:
> > I don't think your hunch at where cpu is spending cycles is correct.
> > Could you please do two experiments:
> > 1. try trivial seccomp bpf prog that simply returns 'allow'
> > 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code
> >   that returns 'allow' and make sure it's noinline or in a different C file,
> >   so that compiler doesn't optimize the whole seccomp_run_filters() into a nop.
> > 
> > Then measure performance of both.
> > I bet you'll see exactly the same numbers.
> 
> Android has already done this, it appeared to not be the same. Calling 
> into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see 
> if I can get you the numbers. I was frankly quite surprised -- I 
> understood the bulk of the seccomp overhead to be in taking the 
> TIF_WORK path, copying arguments, etc, but something else is going on there.

So while it's not the Android measurements, here's what I'm seeing on
x86_64 (this is hardly a perfect noiseless benchmark, but sampling error appears to close to 1%):


net.core.bpf_jit_enable=0:

Benchmarking 16777216 samples...
10.633756139 - 0.004359714 = 10629396425 getpid native: 633 ns
23.008737499 - 10.633967641 = 12374769858 getpid RET_ALLOW 1 filter: 737 ns
36.723141843 - 23.008975696 = 13714166147 getpid RET_ALLOW 2 filters: 817 ns
47.751422021 - 36.723345630 = 11028076391 getpid BPF-less allow: 657 ns Estimated total seccomp overhead for 1 filter: 104 ns Estimated total seccomp overhead for 2 filters: 184 ns Estimated seccomp per-filter overhead: 80 ns Estimated seccomp entry overhead: 24 ns Estimated BPF overhead per filter: 80 ns


net.core.bpf_jit_enable=1:
net.core.bpf_jit_harden=1:

Benchmarking 16777216 samples...
31.939978606 - 21.275190689 = 10664787917 getpid native: 635 ns
43.324592380 - 31.940794751 = 11383797629 getpid RET_ALLOW 1 filter: 678 ns
55.001650599 - 43.326293248 = 11675357351 getpid RET_ALLOW 2 filters: 695 ns
65.986452855 - 55.002249904 = 10984202951 getpid BPF-less allow: 654 ns Estimated total seccomp overhead for 1 filter: 43 ns Estimated total seccomp overhead for 2 filters: 60 ns Estimated seccomp per-filter overhead: 17 ns Estimated seccomp entry overhead: 26 ns Estimated BPF overhead per filter: 24 ns


net.core.bpf_jit_enable=1:
net.core.bpf_jit_harden=0:

Benchmarking 16777216 samples...
10.684681435 - 0.004198682 = 10680482753 getpid native: 636 ns
22.050823167 - 10.685571417 = 11365251750 getpid RET_ALLOW 1 filter: 677 ns
33.714134291 - 22.051100183 = 11663034108 getpid RET_ALLOW 2 filters: 695 ns
44.793312551 - 33.714383001 = 11078929550 getpid BPF-less allow: 660 ns Estimated total seccomp overhead for 1 filter: 41 ns Estimated total seccomp overhead for 2 filters: 59 ns Estimated seccomp per-filter overhead: 18 ns Estimated seccomp entry overhead: 23 ns Estimated BPF overhead per filter: 17 ns


The above is from my (very dangerous!) benchmarking patch[1].

So, with the layered nature of seccomp filters there's a reasonable gain to be seen for a O(1) bitmap lookup to skip running even a single filter, even for the fastest BPF mode.

Not that we need to optimize for the pathological case, but this would be especially useful for cases like systemd, which appears to be constructing seccomp filters very inefficiently maybe on a per-syscall[3] basis? For example, systemd-resolved has 32 (!) seccomp filters
attached[2]:

# grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status
Seccomp_filters:        32

# grep SystemCall /lib/systemd/system/systemd-resolved.service
SystemCallArchitectures=native
SystemCallErrorNumber=EPERM
SystemCallFilter=@system-service

I'd like to better understand what they're doing, but haven't had time to dig in. (The systemd devel mailing list requires subscription, so I've directly CCed some systemd folks that have touched seccomp there recently. Hi! The starts of this thread is here[4].)

-Kees

[1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=seccomp/benchmark-bpf&id=20cc7d8f4238ea3bc1798f204bb865f4994cca27
[2] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/seccomp&id=9d06f16f463cef5c445af9738efed2bfe4c64730
[3] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter=
[4] https://lore.kernel.org/bpf/c22a6c3cefc2412cad00ae14c1371711@huawei.com/

--
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-06-01  2:08       ` 答复: " zhujianwei (C)
@ 2020-06-01  3:30         ` Alexei Starovoitov
  2020-06-02  2:42           ` 答复: " zhujianwei (C)
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2020-06-01  3:30 UTC (permalink / raw)
  To: zhujianwei (C)
  Cc: Kees Cook, bpf, linux-security-module, Hehuazhen,
	Lennart Poettering, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek

On Sun, May 31, 2020 at 7:08 PM zhujianwei (C) <zhujianwei7@huawei.com> wrote:
>
> This is the test result on linux 5.7.0-rc7 for aarch64.
> And retpoline disabled default.
> #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
> Not affected
>
> bpf_jit_enable 1
> bpf_jit_harden 0
>
> We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one).
> The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m):
>
> original:
>         seccomp_off:                    10684939
>         seccomp_on_1_filters:   8513805         overhead:19.8%
>         seccomp_on_4_filters:   7105592         overhead:33.0%
>         seccomp_on_32_filters:  2308677         overhead:78.3%
>
> after replacing bpf_prog_run_pin_on_cpu:
>         seccomp_off:                    10685244
>         seccomp_on_1_filters:   9146483         overhead:14.1%
>         seccomp_on_4_filters:   8969886         overhead:16.0%
>         seccomp_on_32_filters:  6454372         overhead:39.6%
>
> N-filter bpf overhead:
>         1_filters:              5.7%
>         4_filters:              17.0%
>         32_filters:     38.7%
>
> // kernel code modification place
> static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx)
> {
>         return SECCOMP_RET_ALLOW;
> }

This is apples to oranges.
As explained earlier:
https://lore.kernel.org/netdev/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/T/#u
Please use __weak instead of static and redo the numbers.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-05-29 19:27     ` Kees Cook
  2020-05-31 17:19       ` Alexei Starovoitov
  2020-06-01  2:08       ` 答复: " zhujianwei (C)
@ 2020-06-01 10:11       ` Lennart Poettering
  2020-06-01 12:32         ` Paul Moore
  2020-06-01 18:21         ` Kees Cook
  2 siblings, 2 replies; 24+ messages in thread
From: Lennart Poettering @ 2020-06-01 10:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: Alexei Starovoitov, zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek

On Fr, 29.05.20 12:27, Kees Cook (keescook@chromium.org) wrote:

> # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status
> Seccomp_filters:        32
>
> # grep SystemCall /lib/systemd/system/systemd-resolved.service
> SystemCallArchitectures=native
> SystemCallErrorNumber=EPERM
> SystemCallFilter=@system-service
>
> I'd like to better understand what they're doing, but haven't had time
> to dig in. (The systemd devel mailing list requires subscription, so
> I've directly CCed some systemd folks that have touched seccomp there
> recently. Hi! The starts of this thread is here[4].)

Hmm, so on x86-64 we try to install our seccomp filters three times:
for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32
syscall ABI. Not all of the filters we apply work on all ABIs though,
because syscalls are available on some but not others, or cannot
sensibly be matched on some (because of socketcall, ipc and such
multiplexed syscalls).

When we fist added support for seccomp filters to systemd we compiled
everything into a single filter, and let libseccomp apply it to
different archs. But that didn't work out, since libseccomp doesn't
tell use when it manages to apply a filter and when not, i.e. to which
arch it worked and to which arch it didn't. And since we have some
whitelist and some blacklist filters the internal fallback logic of
libsecccomp doesn't work for us either, since you never know what you
end up with. So we ended up breaking the different settings up into
individual filters, and apply them individually and separately for
each arch, so that we know exactly what we managed to install and what
not, and what we can then know will properly filter and can check in
our test suite.

Keeping the filters separate made things a lot easier and simpler to
debug, and our log output and testing became much less of a black
box. We know exactly what worked and what didn't, and our test
validate each filter.

For systemd-resolved we apply a bunch more filters than just those
that are result of SystemCallFilter= and SystemCallArchitectures=
(SystemCallFilter= itself synthesizes one filter per syscall ABI).

1. RestrictSUIDSGID= generates a seccomp filter to generated suid/sgid
   binaries, i.e. filters chmod() and related calls and their
   arguments

2. LockPersonality= blocks personality() for most arguments

3. MemoryDenyWriteExecute= blocks mmap() and similar calls if the
   selected map has X and W set at the same time

4. RestrictRealtime= blocks sched_setscheulder() for most parameters

5. RestrictAddressFamilies= blocks socket() and related calls for
   various address families

6. ProtectKernelLogs= blocks the syslog() syscall for most parameters

7. ProtectKernelTunables= blocks the old _sysctl() syscall among some
   other things

8. RestrictNamespaces= blocks various unshare() and clone() bits

So yeah, if one turns on many of these options in services (and we
generally turn on everything we can for the services we ship) and then
multiply that by the archs you end up with quite a bunch.

If we wanted to optimize that in userspace, then libseccomp would have
to be improved quite substantially to let us know exactly what works
and what doesn't, and to have sane fallback both when building
whitelists and blacklists.

An easy improvement is probably if libseccomp would now start refusing
to install x32 seccomp filters altogether now that x32 is entirely
dead? Or are the entrypoints for x32 syscalls still available in the
kernel? How could userspace figure out if they are available? If
libseccomp doesn't want to add code for that, we probably could have
that in systemd itself too...

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-06-01 10:11       ` Lennart Poettering
@ 2020-06-01 12:32         ` Paul Moore
  2020-06-02 12:53           ` Lennart Poettering
  2020-06-01 18:21         ` Kees Cook
  1 sibling, 1 reply; 24+ messages in thread
From: Paul Moore @ 2020-06-01 12:32 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Kees Cook, Alexei Starovoitov, zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek, Tom Hromatka

On Mon, Jun 1, 2020 at 6:17 AM Lennart Poettering
<lennart@poettering.net> wrote:
> On Fr, 29.05.20 12:27, Kees Cook (keescook@chromium.org) wrote:
> > # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status
> > Seccomp_filters:        32
> >
> > # grep SystemCall /lib/systemd/system/systemd-resolved.service
> > SystemCallArchitectures=native
> > SystemCallErrorNumber=EPERM
> > SystemCallFilter=@system-service
> >
> > I'd like to better understand what they're doing, but haven't had time
> > to dig in. (The systemd devel mailing list requires subscription, so
> > I've directly CCed some systemd folks that have touched seccomp there
> > recently. Hi! The starts of this thread is here[4].)
>
> Hmm, so on x86-64 we try to install our seccomp filters three times:
> for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32
> syscall ABI. Not all of the filters we apply work on all ABIs though,
> because syscalls are available on some but not others, or cannot
> sensibly be matched on some (because of socketcall, ipc and such
> multiplexed syscalls).
>
> When we fist added support for seccomp filters to systemd we compiled
> everything into a single filter, and let libseccomp apply it to
> different archs. But that didn't work out, since libseccomp doesn't
> tell use when it manages to apply a filter and when not, i.e. to which
> arch it worked and to which arch it didn't. And since we have some
> whitelist and some blacklist filters the internal fallback logic of
> libsecccomp doesn't work for us either, since you never know what you
> end up with. So we ended up breaking the different settings up into
> individual filters, and apply them individually and separately for
> each arch, so that we know exactly what we managed to install and what
> not, and what we can then know will properly filter and can check in
> our test suite.
>
> Keeping the filters separate made things a lot easier and simpler to
> debug, and our log output and testing became much less of a black
> box. We know exactly what worked and what didn't, and our test
> validate each filter.

In situations where the calling application creates multiple per-ABI
filters, the seccomp_merge(3) function can be used to merge the
filters into one.  There are some limitations (same byte ordering,
filter attributes, etc.) but in general it should work without problem
when merging x86_64, x32, and x86.

For what it is worth, libseccomp does handle things like the
multiplexed socket syscalls[*] across multiple ABIs, just not quite in
the way Lennart and systemd wanted.  It is also possible, although I
would be a bit surprised, that some of the systemd's concerns have
been resolved in modern libseccomp.  For better or worse, systemd was
one of the first adopters of libseccomp and they had to deal with more
than a few bumps as the library was developed.

[*] Handling the multiplexed syscalls is tricky, especially when one
combines multiple ABIs and the presence of both the multiplexed and
direct-wired syscalls on some kernel versions.  Recent libseccomp
versions do handle all these cases; creating multiplexed filters,
direct-wired filters, or both depending on the particular ABI.  The
problem comes when you try to wrap all of that up in a single library
API that works regardless of the ABI and kernel version across
different build and runtime environments.  This is why we don't
support the "exact" variants of the libseccomp API on filters which
contain multiple ABIs, we simply can't guarantee that we will always
be able to filter on the third argument socket() in a filter than
consists of the x86_64 and x86 ABIs.  The non-exact API variants
create the rules as best they can in this case, creating three rules
in the filter: a x86_64 rule which filters on the third argument of
socket(), a x86 rule which filters on the third argument of the
direct-wired socket(), and a x86 rule which filters on the multiplexed
socketcall(socket) syscall (impossible to filter on the syscall
argument here).

> For systemd-resolved we apply a bunch more filters than just those
> that are result of SystemCallFilter= and SystemCallArchitectures=
> (SystemCallFilter= itself synthesizes one filter per syscall ABI).

...

> So yeah, if one turns on many of these options in services (and we
> generally turn on everything we can for the services we ship) and then
> multiply that by the archs you end up with quite a bunch.

I'm not sure how systemd is architected with respect to seccomp
filtering, but once again it would seem like seccomp_merge() could be
useful here.

> If we wanted to optimize that in userspace, then libseccomp would have
> to be improved quite substantially to let us know exactly what works
> and what doesn't, and to have sane fallback both when building
> whitelists and blacklists.

It has been quite a while since we last talked about systemd's use of
libseccomp, but the upcoming v2.5.0 release (no date set yet, but
think weeks not months) finally takes a first step towards defining
proper return values on error for the API, no more "negative values on
error".  I'm sure there are other things, but I recall this as being
one of the bigger systemd wants.

As an aside, it is always going to be difficult to allow fine grained
control when you have a single libseccomp filter that includes
multiple ABIs; the different ABI oddities are just too great (see
comments above).  If you need exacting control of the filter, or ABI
specific handling, then the recommended way is to create those filters
independently and merge them together before loading them into the
kernel or applying any common rules.

> An easy improvement is probably if libseccomp would now start refusing
> to install x32 seccomp filters altogether now that x32 is entirely
> dead? Or are the entrypoints for x32 syscalls still available in the
> kernel? How could userspace figure out if they are available? If
> libseccomp doesn't want to add code for that, we probably could have
> that in systemd itself too...

You can eliminate x32 syscalls today using libseccomp though either
the "BADARCH" filter attribute or through a x32 specific filter that
defaults to KILL/ERRNO/etc. and has no rules (of course you could
merge this x32 filter with your x86_64 filter).

While I don't see us removing the ability to create x32 filters from
libseccomp any time soon (need to support older kernels), I can say
that I would be very happy to see x32 removed from systems.
Regardless of what one may think of the wisdom in creating this ABI, I
think we can agree the implementation was a bit of a hack.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-05-31 17:19       ` Alexei Starovoitov
@ 2020-06-01 18:16         ` Kees Cook
  0 siblings, 0 replies; 24+ messages in thread
From: Kees Cook @ 2020-06-01 18:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Lennart Poettering,
	Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek, daniel,
	netdev

On Sun, May 31, 2020 at 10:19:15AM -0700, Alexei Starovoitov wrote:
> Thank you for crafting a benchmark.
> The only thing that it's not doing a fair comparison.
> The problem with that patch [1] that is using:
> 
> static noinline u32 __seccomp_benchmark(struct bpf_prog *prog,
>                                         const struct seccomp_data *sd)
> {
>         return SECCOMP_RET_ALLOW;
> }
> 
> as a benchmarking function.
> The 'noinline' keyword tells the compiler to keep the body of the function, but
> the compiler is still doing full control and data flow analysis though this
> function and it is smart enough to optimize its usage in seccomp_run_filters()
> and in __seccomp_filter() because all functions are in a single .c file.
> Lots of code gets optimized away when 'f->benchmark' is on.
> 
> To make it into fair comparison I've added the following patch
> on top of your [1].
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 2fdbf5ad8372..86204422e096 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -244,7 +244,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>         return 0;
>  }
> 
> -static noinline u32 __seccomp_benchmark(struct bpf_prog *prog,
> +__weak noinline u32 __seccomp_benchmark(struct bpf_prog *prog,
>                                         const struct seccomp_data *sd)
> 
> Please take a look at 'make kernel/seccomp.s' before and after to see the difference
> __weak keyword makes.

Ah yeah, thanks. That does bring it up to the same overhead. Nice!

> And here is what seccomp_benchmark now reports:
> 
> Benchmarking 33554432 samples...
> 22.618269641 - 15.030812794 = 7587456847
> getpid native: 226 ns
> 30.792042986 - 22.619048831 = 8172994155
> getpid RET_ALLOW 1 filter: 243 ns
> 39.451435038 - 30.792836778 = 8658598260
> getpid RET_ALLOW 2 filters: 258 ns
> 47.616011529 - 39.452190830 = 8163820699
> getpid BPF-less allow: 243 ns
> Estimated total seccomp overhead for 1 filter: 17 ns
> Estimated total seccomp overhead for 2 filters: 32 ns
> Estimated seccomp per-filter overhead: 15 ns
> Estimated seccomp entry overhead: 2 ns
> Estimated BPF overhead per filter: 0 ns
> 
> [...]
> 
> > So, with the layered nature of seccomp filters there's a reasonable gain
> > to be seen for a O(1) bitmap lookup to skip running even a single filter,
> > even for the fastest BPF mode.
> 
> This is not true.
> The O(1) bitmap implemented as kernel C code will have exactly the same speed
> as O(1) bitmap implemented as eBPF program.

Yes, that'd be true if it was the first (and only) filter. What I'm
trying to provide is a mechanism to speed up the syscalls for all
attached filters (i.e. create a seccomp fast-path). The reality of
seccomp usage is that it's very layered: systemd sets some (or many!),
then container runtime sets some, then the process itself might set
some.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-06-01 10:11       ` Lennart Poettering
  2020-06-01 12:32         ` Paul Moore
@ 2020-06-01 18:21         ` Kees Cook
  2020-06-02 12:44           ` Lennart Poettering
  1 sibling, 1 reply; 24+ messages in thread
From: Kees Cook @ 2020-06-01 18:21 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Alexei Starovoitov, zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek

On Mon, Jun 01, 2020 at 12:11:37PM +0200, Lennart Poettering wrote:
> On Fr, 29.05.20 12:27, Kees Cook (keescook@chromium.org) wrote:
> 
> > # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status
> > Seccomp_filters:        32
> >
> > # grep SystemCall /lib/systemd/system/systemd-resolved.service
> > SystemCallArchitectures=native
> > SystemCallErrorNumber=EPERM
> > SystemCallFilter=@system-service
> >
> > I'd like to better understand what they're doing, but haven't had time
> > to dig in. (The systemd devel mailing list requires subscription, so
> > I've directly CCed some systemd folks that have touched seccomp there
> > recently. Hi! The starts of this thread is here[4].)
> 
> Hmm, so on x86-64 we try to install our seccomp filters three times:
> for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32
> syscall ABI. Not all of the filters we apply work on all ABIs though,
> because syscalls are available on some but not others, or cannot
> sensibly be matched on some (because of socketcall, ipc and such
> multiplexed syscalls).
>
> [...]

Thanks for the details on this! That helps me understand what's
happening much better. :)

> An easy improvement is probably if libseccomp would now start refusing
> to install x32 seccomp filters altogether now that x32 is entirely
> dead? Or are the entrypoints for x32 syscalls still available in the
> kernel? How could userspace figure out if they are available? If
> libseccomp doesn't want to add code for that, we probably could have
> that in systemd itself too...

Would it make sense to provide a systemd setting for services to declare
"no compat" or "no x32" (I'm not sure what to call this mode more
generically, "no 32-bit allocation ABI"?) Then you can just install
a single merged filter for all the native syscalls that starts with
"if not native, reject"?

(Or better yet: make the default for filtering be "native only", and
let services opt into other ABIs?)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* 答复: new seccomp mode aims to improve performance
  2020-06-01  3:30         ` Alexei Starovoitov
@ 2020-06-02  2:42           ` zhujianwei (C)
  2020-06-02  3:24             ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: zhujianwei (C) @ 2020-06-02  2:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kees Cook, bpf, linux-security-module, Hehuazhen,
	Lennart Poettering, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek

>
> This is the test result on linux 5.7.0-rc7 for aarch64.
> And retpoline disabled default.
> #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
> Not affected
>
> bpf_jit_enable 1
> bpf_jit_harden 0
>
> We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one).
> The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m):
>
> original:
>         seccomp_off:                    10684939
>         seccomp_on_1_filters:   8513805         overhead:19.8%
>         seccomp_on_4_filters:   7105592         overhead:33.0%
>         seccomp_on_32_filters:  2308677         overhead:78.3%
>
> after replacing bpf_prog_run_pin_on_cpu:
>         seccomp_off:                    10685244
>         seccomp_on_1_filters:   9146483         overhead:14.1%
>         seccomp_on_4_filters:   8969886         overhead:16.0%
>         seccomp_on_32_filters:  6454372         overhead:39.6%
>
> N-filter bpf overhead:
>         1_filters:              5.7%
>         4_filters:              17.0%
>         32_filters:     38.7%
>
> // kernel code modification place
> static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct 
> bpf_prog *prog, const void *ctx) {
>         return SECCOMP_RET_ALLOW;
> }

>This is apples to oranges.
>As explained earlier:
>https://lore.kernel.org/netdev/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/T/#u
>Please use __weak instead of static and redo the numbers.


we have replaced ‘static’ with ‘__weak’, tested with the same way, and got almostly the same result, in our test environment(aarch64).

-static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx)
+__weak noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx)

original:
	seccomp_off:			10684939
	seccomp_on_1_filters:	8513805		overhead:19.8%
	seccomp_on_4_filters:	7105592		overhead:33.0%
	seccomp_on_32_filters:	2308677		overhead:78.3%
	
after replacing bpf_prog_run_pin_on_cpu:
	seccomp_off:			10667195
	seccomp_on_1_filters:	9147454		overhead:14.2%
	seccomp_on_4_filters:	8927605		overhead:16.1%
	seccomp_on_32_filters:	6355476		overhead:40.6%

N-filter bpf overhead:
	1_filters:		5.6%
	4_filters:		16.9%
	32_filters:	37.7%




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 答复: new seccomp mode aims to improve performance
  2020-06-02  2:42           ` 答复: " zhujianwei (C)
@ 2020-06-02  3:24             ` Alexei Starovoitov
  2020-06-02 11:13               ` 答复: " zhujianwei (C)
  2020-06-02 11:34               ` zhujianwei (C)
  0 siblings, 2 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2020-06-02  3:24 UTC (permalink / raw)
  To: zhujianwei (C)
  Cc: Kees Cook, bpf, linux-security-module, Hehuazhen,
	Lennart Poettering, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek

On Tue, Jun 02, 2020 at 02:42:35AM +0000, zhujianwei (C) wrote:
> >
> > This is the test result on linux 5.7.0-rc7 for aarch64.
> > And retpoline disabled default.
> > #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
> > Not affected
> >
> > bpf_jit_enable 1
> > bpf_jit_harden 0
> >
> > We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one).
> > The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m):
> >
> > original:
> >         seccomp_off:                    10684939
> >         seccomp_on_1_filters:   8513805         overhead:19.8%
> >         seccomp_on_4_filters:   7105592         overhead:33.0%
> >         seccomp_on_32_filters:  2308677         overhead:78.3%
> >
> > after replacing bpf_prog_run_pin_on_cpu:
> >         seccomp_off:                    10685244
> >         seccomp_on_1_filters:   9146483         overhead:14.1%
> >         seccomp_on_4_filters:   8969886         overhead:16.0%
> >         seccomp_on_32_filters:  6454372         overhead:39.6%
> >
> > N-filter bpf overhead:
> >         1_filters:              5.7%
> >         4_filters:              17.0%
> >         32_filters:     38.7%
> >
> > // kernel code modification place
> > static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct 
> > bpf_prog *prog, const void *ctx) {
> >         return SECCOMP_RET_ALLOW;
> > }
> 
> >This is apples to oranges.
> >As explained earlier:
> >https://lore.kernel.org/netdev/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/T/#u
> >Please use __weak instead of static and redo the numbers.
> 
> 
> we have replaced ‘static’ with ‘__weak’, tested with the same way, and got almostly the same result, in our test environment(aarch64).
> 
> -static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx)
> +__weak noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx)
> 
> original:
> 	seccomp_off:			10684939
> 	seccomp_on_1_filters:	8513805		overhead:19.8%
> 	seccomp_on_4_filters:	7105592		overhead:33.0%
> 	seccomp_on_32_filters:	2308677		overhead:78.3%
> 	
> after replacing bpf_prog_run_pin_on_cpu:
> 	seccomp_off:			10667195
> 	seccomp_on_1_filters:	9147454		overhead:14.2%
> 	seccomp_on_4_filters:	8927605		overhead:16.1%
> 	seccomp_on_32_filters:	6355476		overhead:40.6%

are you saying that by replacing 'static' with '__weak' it got slower?!
Something doesn't add up. Please check generated assembly.
By having such 'static noinline bpf_prog_run_pin_on_cpu' you're telling
compiler to remove most of seccomp_run_filters() code which now will
return only two possible values. Which further means that large 'switch'
statement in __seccomp_filter() is also optimized. populate_seccomp_data()
is removed. Etc, etc. That explains 14% vs 19% difference.
May be you have some debug on? Like cant_migrate() is not a nop?
Or static_branch is not supported?
The sure way is to check assembly.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* 答复: 答复: new seccomp mode aims to improve performance
  2020-06-02  3:24             ` Alexei Starovoitov
@ 2020-06-02 11:13               ` zhujianwei (C)
  2020-06-02 11:34               ` zhujianwei (C)
  1 sibling, 0 replies; 24+ messages in thread
From: zhujianwei (C) @ 2020-06-02 11:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kees Cook, bpf, linux-security-module, Hehuazhen,
	Lennart Poettering, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek

> > This is the test result on linux 5.7.0-rc7 for aarch64.
> > And retpoline disabled default.
> > #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
> > Not affected
> >
> > bpf_jit_enable 1
> > bpf_jit_harden 0
> >
> > We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one).
> > The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m):
> >
> > original:
> >         seccomp_off:                    10684939
> >         seccomp_on_1_filters:   8513805         overhead:19.8%
> >         seccomp_on_4_filters:   7105592         overhead:33.0%
> >         seccomp_on_32_filters:  2308677         overhead:78.3%
> >
> > after replacing bpf_prog_run_pin_on_cpu:
> >         seccomp_off:                    10685244
> >         seccomp_on_1_filters:   9146483         overhead:14.1%
> >         seccomp_on_4_filters:   8969886         overhead:16.0%
> >         seccomp_on_32_filters:  6454372         overhead:39.6%
> >
> > N-filter bpf overhead:
> >         1_filters:              5.7%
> >         4_filters:              17.0%
> >         32_filters:     38.7%
> >
> > // kernel code modification place
> > static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct 
> > bpf_prog *prog, const void *ctx) {
> >         return SECCOMP_RET_ALLOW;
> > }
> 
> >This is apples to oranges.
> >As explained earlier:
> >https://lore.kernel.org/netdev/20200531171915.wsxvdjeetmhpsdv2@ast-mb
> >p.dhcp.thefacebook.com/T/#u Please use __weak instead of static and 
> >redo the numbers.
> 
> 
> we have replaced ‘static’ with ‘__weak’, tested with the same way, and got almostly the same result, in our test environment(aarch64).
> 
> -static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct 
> bpf_prog *prog, const void *ctx)
> +__weak noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct 
> +bpf_prog *prog, const void *ctx)
> 
> original:
> 	seccomp_off:			10684939
> 	seccomp_on_1_filters:	8513805		overhead:19.8%
> 	seccomp_on_4_filters:	7105592		overhead:33.0%
> 	seccomp_on_32_filters:	2308677		overhead:78.3%
> 	
> after replacing bpf_prog_run_pin_on_cpu:
> 	seccomp_off:			10667195
> 	seccomp_on_1_filters:	9147454		overhead:14.2%
> 	seccomp_on_4_filters:	8927605		overhead:16.1%
> 	seccomp_on_32_filters:	6355476		overhead:40.6%

> are you saying that by replacing 'static' with '__weak' it got slower?!
> Something doesn't add up. Please check generated assembly.
> By having such 'static noinline bpf_prog_run_pin_on_cpu' you're telling compiler to remove most of seccomp_run_filters() code which now will return only two possible values. Which further means that large 'switch'
> statement in __seccomp_filter() is also optimized. populate_seccomp_data() is removed. Etc, etc. That explains 14% vs 19% difference.
> May be you have some debug on? Like cant_migrate() is not a nop?
> Or static_branch is not supported?
> The sure way is to check assembly.

No, we say that by replacing 'static' with '__weak' it got the same result, in our testcase which filters 20 allowed syscall num (for details, see the previous post). 

static case:
	N-filter bpf overhead:
	1_filters:		5.7%
	4_filters:		17.0%
	32_filters:	38.7%

__weak case:
	N-filter bpf overhead:
	1_filters:		5.6%
	4_filters:		16.9%
	32_filters:	37.7%

^ permalink raw reply	[flat|nested] 24+ messages in thread

* 答复: 答复: new seccomp mode aims to improve performance
  2020-06-02  3:24             ` Alexei Starovoitov
  2020-06-02 11:13               ` 答复: " zhujianwei (C)
@ 2020-06-02 11:34               ` zhujianwei (C)
  2020-06-02 18:32                 ` Kees Cook
  1 sibling, 1 reply; 24+ messages in thread
From: zhujianwei (C) @ 2020-06-02 11:34 UTC (permalink / raw)
  To: Alexei Starovoitov, Kees Cook
  Cc: bpf, linux-security-module, Hehuazhen, Lennart Poettering,
	Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek

> > > This is the test result on linux 5.7.0-rc7 for aarch64.
> > > And retpoline disabled default.
> > > #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
> > > Not affected
> > >
> > > bpf_jit_enable 1
> > > bpf_jit_harden 0
> > >
> > > We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one).
> > > The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m):
> > >
> > > original:
> > >         seccomp_off:                    10684939
> > >         seccomp_on_1_filters:   8513805         overhead:19.8%
> > >         seccomp_on_4_filters:   7105592         overhead:33.0%
> > >         seccomp_on_32_filters:  2308677         overhead:78.3%
> > >
> > > after replacing bpf_prog_run_pin_on_cpu:
> > >         seccomp_off:                    10685244
> > >         seccomp_on_1_filters:   9146483         overhead:14.1%
> > >         seccomp_on_4_filters:   8969886         overhead:16.0%
> > >         seccomp_on_32_filters:  6454372         overhead:39.6%
> > >
> > > N-filter bpf overhead:
> > >         1_filters:              5.7%
> > >         4_filters:              17.0%
> > >         32_filters:     38.7%
> > >
> > > // kernel code modification place
> > > static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct 
> > > bpf_prog *prog, const void *ctx) {
> > >         return SECCOMP_RET_ALLOW;
> > > }
> > 
> > >This is apples to oranges.
> > >As explained earlier:
> > >https://lore.kernel.org/netdev/20200531171915.wsxvdjeetmhpsdv2@ast-mb
> > >p.dhcp.thefacebook.com/T/#u Please use __weak instead of static and 
> > >redo the numbers.
> > 
> > 
> > we have replaced ‘static’ with ‘__weak’, tested with the same way, and got almostly the same result, in our test environment(aarch64).
> > 
> > -static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct 
> > bpf_prog *prog, const void *ctx)
> > +__weak noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct 
> > +bpf_prog *prog, const void *ctx)
> > 
> > original:
> > 	seccomp_off:			10684939
> > 	seccomp_on_1_filters:	8513805		overhead:19.8%
> > 	seccomp_on_4_filters:	7105592		overhead:33.0%
> > 	seccomp_on_32_filters:	2308677		overhead:78.3%
> > 	
> > after replacing bpf_prog_run_pin_on_cpu:
> > 	seccomp_off:			10667195
> > 	seccomp_on_1_filters:	9147454		overhead:14.2%
> > 	seccomp_on_4_filters:	8927605		overhead:16.1%
> > 	seccomp_on_32_filters:	6355476		overhead:40.6%

> > are you saying that by replacing 'static' with '__weak' it got slower?!
> > Something doesn't add up. Please check generated assembly.
> > By having such 'static noinline bpf_prog_run_pin_on_cpu' you're telling compiler to remove most of seccomp_run_filters() code which now will return only two possible values. Which further means that large 'switch'
> > statement in __seccomp_filter() is also optimized. populate_seccomp_data() is removed. Etc, etc. That explains 14% vs 19% difference.
> > May be you have some debug on? Like cant_migrate() is not a nop?
> > Or static_branch is not supported?
> > The sure way is to check assembly.

> No, we say that by replacing 'static' with '__weak' it got the same result, in our testcase which filters 20 allowed syscall num (for details, see the previous post). 
>
> static case:
>	N-filter bpf overhead:
>	1_filters:		5.7%
>	4_filters:		17.0%
>	32_filters:	38.7%
>
> __weak case:
>	N-filter bpf overhead:
>	1_filters:		5.6%
>	4_filters:		16.9%
>	32_filters:	37.7%

And in many scenarios, the requirement for syscall filter is usually simple, and does not need complex filter rules, for example, just configure a syscall black or white list. However, we have noticed that seccomp will have a performance overhead that cannot be ignored in this simple scenario. For example, referring to Kees's t est data, this cost is almost 41/636 = 6.5%, and Alex's data is 17/226 = 7.5%, based on single rule of filtering (getpid); Our data for this overhead is 19.8% (refer to the previous 'orignal' test results), filtering based on our 20 rules (unixbench syscall).

To improve the filtering performance in these scenarios, as a PoC, we replaced calling seccomp_computing() in syscall_trace_enter(), with a light_syscall_filter() which uses a bitmap for syscall filter, and only filter syscall-num-only case without the syscall-args case. The light_syscall_filter() implemented as a sub-branch in seccomp_computing().

To measure the performance, we use the same unixbench syscall testcase with 20 allowed syscall-num rules. The result shows that the light_syscall_filter only imposed 1.2% overhead. 
Can we take a deep discussion to add this light filter mode?

This is the details:

light_syscall_filter:	//run unixbench syscall testcase 10 times and get the average.
	seccomp_off:			10684018
	seccomp_on_1_filters:	10553233
	overhead:				1.2%

// kernel modification
--- linux-5.7-rc7_1/arch/arm64/kernel/ptrace.c	2020-05-25 06:32:54.000000000 +0800
+++ linux-5.7-rc7/arch/arm64/kernel/ptrace.c	2020-06-02 12:35:04.412000000 +0800
@@ -1827,6 +1827,46 @@
 	regs->regs[regno] = saved_reg;
 }
 
+#define PID_MAX    1000000
+#define SYSNUM_MAX 0x220
+
+/* all zero*/
+bool g_light_filter_switch[PID_MAX] = {0};
+bool g_light_filter_bitmap[PID_MAX][SYSNUM_MAX] = {0};
+
+
+static int __light_syscall_filter(void) {
+   int pid;
+	int this_syscall;
+
+   pid = current->pid;
+	this_syscall = syscall_get_nr(current, task_pt_regs(current));
+
+   if(g_light_filter_bitmap[pid][this_syscall] == true) {
+       printk(KERN_ERR "light syscall filter: syscall num %d denied.\n", this_syscall);
+		goto skip;
+   }
+
+	return 0;
+skip:	
+	return -1;
+}
+
+static inline int light_syscall_filter(void) {
+	if (unlikely(test_thread_flag(TIF_SECCOMP))) {
+                 return __light_syscall_filter();
+        }
+
+	return 0;
+}
+
 int syscall_trace_enter(struct pt_regs *regs)
 {
 	unsigned long flags = READ_ONCE(current_thread_info()->flags);
@@ -1837,9 +1877,10 @@
 			return -1;
 	}
 
-	/* Do the secure computing after ptrace; failures should be fast. */
-	if (secure_computing() == -1)
+	/* light check for syscall-num-only rule. */
+	if (light_syscall_filter() == -1) {
 		return -1;
+	}
 
 	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
 		trace_sys_enter(regs, regs->syscallno);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-06-01 18:21         ` Kees Cook
@ 2020-06-02 12:44           ` Lennart Poettering
  2020-06-02 18:37             ` Kees Cook
  2020-06-16  6:00             ` Kees Cook
  0 siblings, 2 replies; 24+ messages in thread
From: Lennart Poettering @ 2020-06-02 12:44 UTC (permalink / raw)
  To: Kees Cook
  Cc: Alexei Starovoitov, zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek

On Mo, 01.06.20 11:21, Kees Cook (keescook@chromium.org) wrote:

> > > # grep SystemCall /lib/systemd/system/systemd-resolved.service
> > > SystemCallArchitectures=native
> > > SystemCallErrorNumber=EPERM
> > > SystemCallFilter=@system-service
> > >
> > > I'd like to better understand what they're doing, but haven't had time
> > > to dig in. (The systemd devel mailing list requires subscription, so
> > > I've directly CCed some systemd folks that have touched seccomp there
> > > recently. Hi! The starts of this thread is here[4].)
> >
> > Hmm, so on x86-64 we try to install our seccomp filters three times:
> > for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32
> > syscall ABI. Not all of the filters we apply work on all ABIs though,
> > because syscalls are available on some but not others, or cannot
> > sensibly be matched on some (because of socketcall, ipc and such
> > multiplexed syscalls).
> >
> > [...]
>
> Thanks for the details on this! That helps me understand what's
> happening much better. :)
>
> > An easy improvement is probably if libseccomp would now start refusing
> > to install x32 seccomp filters altogether now that x32 is entirely
> > dead? Or are the entrypoints for x32 syscalls still available in the
> > kernel? How could userspace figure out if they are available? If
> > libseccomp doesn't want to add code for that, we probably could have
> > that in systemd itself too...
>
> Would it make sense to provide a systemd setting for services to declare
> "no compat" or "no x32" (I'm not sure what to call this mode more
> generically, "no 32-bit allocation ABI"?) Then you can just install
> a single merged filter for all the native syscalls that starts with
> "if not native, reject"?

We have that actually, it's this line you pasted above:

        SystemCallArchitectures=native

It means: block all syscall ABIs but the native one for all processes
of this service.

We currently use that setting only to synthesize an explicit seccomp
filter masking the other ABIs wholesale. We do not use it to suppress
generation of other, unrelated seccomp filters for that
arch. i.e. which means you might end up with one filter blocking x32
wholesale, but then another unrelated option might install a filter
blocking some specific syscall with some specific arguments, but still
gets installed for x86-64 *and* i386 *and* x32. I guess we could
relatively easily tweak that and suppress the latter. If we did, then
on all services that set SystemCallArchitectures=native on x86-64 the
number of installed seccomp filters should become a third.

> (Or better yet: make the default for filtering be "native only", and
> let services opt into other ABIs?)

That sounds like it would make people quite unhappy no? given that on
a systemd system anything that runs in userspace is ultimately part of
a service managed by systemd, if we'd default to "no native ABIs" this
would translate to "yeah, we entirely disable the i386 ABI for the
entire system unless you reconfigure it and/or opt-out your old i386
services".

Hence, on x86-64, I figure just masking i386 entirely is a bit too
drastic a compat breakage for us, no? Masking x32 otoh sounds like a
safe default to do without breaking too much compat given that x32 is
on its way out.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-06-01 12:32         ` Paul Moore
@ 2020-06-02 12:53           ` Lennart Poettering
  2020-06-02 15:03             ` Paul Moore
  0 siblings, 1 reply; 24+ messages in thread
From: Lennart Poettering @ 2020-06-02 12:53 UTC (permalink / raw)
  To: Paul Moore
  Cc: Kees Cook, Alexei Starovoitov, zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek, Tom Hromatka

On Mo, 01.06.20 08:32, Paul Moore (paul@paul-moore.com) wrote:

> In situations where the calling application creates multiple per-ABI
> filters, the seccomp_merge(3) function can be used to merge the
> filters into one.  There are some limitations (same byte ordering,
> filter attributes, etc.) but in general it should work without problem
> when merging x86_64, x32, and x86.

Hmm, so we currently only use seccomp_rule_add_exact() to build an
individual filter and finally seccomp_load() to install it. Which
tells us exactly what works and what does not.

If we now would use seccomp_rule_add_exact() to build the filters, but
then use seccomp_merge() to merge them all, and then only do a single
seccomp_load(), will this give us the same info? i.e. will
seccomp_merge() give us the same errors seccomp_load() currently gives
us when something cannot work?

> > If we wanted to optimize that in userspace, then libseccomp would have
> > to be improved quite substantially to let us know exactly what works
> > and what doesn't, and to have sane fallback both when building
> > whitelists and blacklists.
>
> It has been quite a while since we last talked about systemd's use of
> libseccomp, but the upcoming v2.5.0 release (no date set yet, but
> think weeks not months) finally takes a first step towards defining
> proper return values on error for the API, no more "negative values on
> error".  I'm sure there are other things, but I recall this as being
> one of the bigger systemd wants.

Yes, we care about error codes a lot.

> As an aside, it is always going to be difficult to allow fine grained
> control when you have a single libseccomp filter that includes
> multiple ABIs; the different ABI oddities are just too great (see
> comments above).  If you need exacting control of the filter, or ABI
> specific handling, then the recommended way is to create those filters
> independently and merge them together before loading them into the
> kernel or applying any common rules.

Hmm, so not sure I got this. But are you saying that when using
seccomp_merge() am I supposed to merge filters for different archs
into one megafilter, or are you saying the opposite: am I supposed not
to do that?

I.e. in an ideal world, should we come to a situation where per
service on x86-64 we will have exactly one filter installed, or should
we come to a situation where we'll have exactly three installed, once
for each ABI?

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-06-02 12:53           ` Lennart Poettering
@ 2020-06-02 15:03             ` Paul Moore
  2020-06-02 18:39               ` Kees Cook
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Moore @ 2020-06-02 15:03 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Kees Cook, Alexei Starovoitov, zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek, Tom Hromatka

On Tue, Jun 2, 2020 at 8:53 AM Lennart Poettering
<lennart@poettering.net> wrote:
> On Mo, 01.06.20 08:32, Paul Moore (paul@paul-moore.com) wrote:
>
> > In situations where the calling application creates multiple per-ABI
> > filters, the seccomp_merge(3) function can be used to merge the
> > filters into one.  There are some limitations (same byte ordering,
> > filter attributes, etc.) but in general it should work without problem
> > when merging x86_64, x32, and x86.
>
> Hmm, so we currently only use seccomp_rule_add_exact() to build an
> individual filter and finally seccomp_load() to install it. Which
> tells us exactly what works and what does not.
>
> If we now would use seccomp_rule_add_exact() to build the filters, but
> then use seccomp_merge() to merge them all, and then only do a single
> seccomp_load(), will this give us the same info? i.e. will
> seccomp_merge() give us the same errors seccomp_load() currently gives
> us when something cannot work?

Yes, it should.  The seccomp_merge() API was created for applications,
like systemd, that create multiple per-ABI filters but still wanted to
consolidate the multiple filters into one, either to add common rules
or reduce the number of filters loaded into the kernel.

Since the motivation behind the seccomp_merge() API was to help make
it easier to create complex filters for multiple ABIs, it is limited
to merging filters that do not share any of the same ABIs.  For
example, you could merge a filter that contains a x86 ABI with another
filter that contains x86_64 and x32 ABIs but you can *not* merge two
filters that both contain x86 ABIs.  This greatly simplifies the merge
operation and makes it fairly quick.

If you try it and run into any problems let Tom (CC'd) or I know and
we will try to help you guys out.

> > > If we wanted to optimize that in userspace, then libseccomp would have
> > > to be improved quite substantially to let us know exactly what works
> > > and what doesn't, and to have sane fallback both when building
> > > whitelists and blacklists.
> >
> > It has been quite a while since we last talked about systemd's use of
> > libseccomp, but the upcoming v2.5.0 release (no date set yet, but
> > think weeks not months) finally takes a first step towards defining
> > proper return values on error for the API, no more "negative values on
> > error".  I'm sure there are other things, but I recall this as being
> > one of the bigger systemd wants.
>
> Yes, we care about error codes a lot.

The lack of any real stability in error codes was my fault, we really
should have done it sooner but I felt there were bigger issues to
focus on at the time.  Regardless, starting with v2.5.0 we will have
stable error codes that are documented in the manpages; the number of
errors, and the associated documentation, may be relatively limited to
start with, but it should improve over time as we get feedback.

> > As an aside, it is always going to be difficult to allow fine grained
> > control when you have a single libseccomp filter that includes
> > multiple ABIs; the different ABI oddities are just too great (see
> > comments above).  If you need exacting control of the filter, or ABI
> > specific handling, then the recommended way is to create those filters
> > independently and merge them together before loading them into the
> > kernel or applying any common rules.
>
> Hmm, so not sure I got this. But are you saying that when using
> seccomp_merge() am I supposed to merge filters for different archs
> into one megafilter, or are you saying the opposite: am I supposed not
> to do that?

Okay, let me try this again ...

In the context of this discussion it looks like limiting the number of
seccomp filters loaded into the kernel is desirable and one way to do
that with libseccomp is to use the seccomp_merge() API to consolidate
multiple per-ABI filters into one "megafilter" (I like that term)
which can be loaded into the kernel.  Does that make more sense?

> I.e. in an ideal world, should we come to a situation where per
> service on x86-64 we will have exactly one filter installed, or should
> we come to a situation where we'll have exactly three installed, once
> for each ABI?

Perhaps others will clarify, but from my reading of this thread there
is a performance advantage to be gained by limiting the number of
seccomp filters installed for a given process.

--
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 答复: 答复: new seccomp mode aims to improve performance
  2020-06-02 11:34               ` zhujianwei (C)
@ 2020-06-02 18:32                 ` Kees Cook
  2020-06-03  4:51                   ` 答复: " zhujianwei (C)
  0 siblings, 1 reply; 24+ messages in thread
From: Kees Cook @ 2020-06-02 18:32 UTC (permalink / raw)
  To: zhujianwei (C)
  Cc: Alexei Starovoitov, bpf, linux-security-module, Hehuazhen,
	Lennart Poettering, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek

On Tue, Jun 02, 2020 at 11:34:04AM +0000, zhujianwei (C) wrote:
> And in many scenarios, the requirement for syscall filter is usually
> simple, and does not need complex filter rules, for example, just
> configure a syscall black or white list. However, we have noticed that
> seccomp will have a performance overhead that cannot be ignored in this
> simple scenario. For example, referring to Kees's t est data, this cost
> is almost 41/636 = 6.5%, and Alex's data is 17/226 = 7.5%, based on
> single rule of filtering (getpid); Our data for this overhead is 19.8%
> (refer to the previous 'orignal' test results), filtering based on our
> 20 rules (unixbench syscall).

I wonder if aarch64 has higher overhead for calling into the TIF_WORK
trace stuff? (Or if aarch64's BPF JIT is not as efficient as x86?)

> // kernel modification
> --- linux-5.7-rc7_1/arch/arm64/kernel/ptrace.c	2020-05-25 06:32:54.000000000 +0800
> +++ linux-5.7-rc7/arch/arm64/kernel/ptrace.c	2020-06-02 12:35:04.412000000 +0800
> @@ -1827,6 +1827,46 @@
>  	regs->regs[regno] = saved_reg;
>  }
>  
> +#define PID_MAX    1000000
> +#define SYSNUM_MAX 0x220

You can use NR_syscalls here, I think.

> +
> +/* all zero*/
> +bool g_light_filter_switch[PID_MAX] = {0};
> +bool g_light_filter_bitmap[PID_MAX][SYSNUM_MAX] = {0};

These can be static, and I would double-check your allocation size -- I
suspect this is allocating a byte for each bool. I would recommend
DECLARE_BITMAP() and friends.

> +static int __light_syscall_filter(void) {
> +   int pid;
> +	int this_syscall;
> +
> +   pid = current->pid;
> +	this_syscall = syscall_get_nr(current, task_pt_regs(current));
> +
> +   if(g_light_filter_bitmap[pid][this_syscall] == true) {
> +       printk(KERN_ERR "light syscall filter: syscall num %d denied.\n", this_syscall);
> +		goto skip;
> +   }
> +
> +	return 0;
> +skip:	
> +	return -1;
> +}
> +
> +static inline int light_syscall_filter(void) {
> +	if (unlikely(test_thread_flag(TIF_SECCOMP))) {
> +                 return __light_syscall_filter();
> +        }
> +
> +	return 0;
> +}
> +
>  int syscall_trace_enter(struct pt_regs *regs)
>  {
>  	unsigned long flags = READ_ONCE(current_thread_info()->flags);
> @@ -1837,9 +1877,10 @@
>  			return -1;
>  	}
>  
> -	/* Do the secure computing after ptrace; failures should be fast. */
> -	if (secure_computing() == -1)
> +	/* light check for syscall-num-only rule. */
> +	if (light_syscall_filter() == -1) {
>  		return -1;
> +	}
>  
>  	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
>  		trace_sys_enter(regs, regs->syscallno);

Given that you're still doing this in syscall_trace_enter(), I imagine
it could live in secure_computing().

Anyway, the functionality here is similar to what I've been working
on for bitmaps (having a global preallocated bitmap isn't going to be
upstreamable, but it's good for PoC). The complications are with handling
differing architecture (for compat systems), tracking/choosing between
the various basic SECCOMP_RET_* behaviors, etc.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-06-02 12:44           ` Lennart Poettering
@ 2020-06-02 18:37             ` Kees Cook
  2020-06-16  6:00             ` Kees Cook
  1 sibling, 0 replies; 24+ messages in thread
From: Kees Cook @ 2020-06-02 18:37 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Alexei Starovoitov, zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek

On Tue, Jun 02, 2020 at 02:44:31PM +0200, Lennart Poettering wrote:
> On Mo, 01.06.20 11:21, Kees Cook (keescook@chromium.org) wrote:
> > Would it make sense to provide a systemd setting for services to declare
> > "no compat" or "no x32" (I'm not sure what to call this mode more
> > generically, "no 32-bit allocation ABI"?) Then you can just install
> > a single merged filter for all the native syscalls that starts with
> > "if not native, reject"?
> 
> We have that actually, it's this line you pasted above:
> 
>         SystemCallArchitectures=native
> 
> It means: block all syscall ABIs but the native one for all processes
> of this service.
> 
> We currently use that setting only to synthesize an explicit seccomp
> filter masking the other ABIs wholesale. We do not use it to suppress
> generation of other, unrelated seccomp filters for that
> arch. i.e. which means you might end up with one filter blocking x32
> wholesale, but then another unrelated option might install a filter
> blocking some specific syscall with some specific arguments, but still
> gets installed for x86-64 *and* i386 *and* x32. I guess we could
> relatively easily tweak that and suppress the latter. If we did, then
> on all services that set SystemCallArchitectures=native on x86-64 the
> number of installed seccomp filters should become a third.

Right, that's what I meant -- on x86_64 we've got way too many filters
installed if we only care about "native" arch. ;)

> > (Or better yet: make the default for filtering be "native only", and
> > let services opt into other ABIs?)
> 
> That sounds like it would make people quite unhappy no? given that on
> a systemd system anything that runs in userspace is ultimately part of
> a service managed by systemd, if we'd default to "no native ABIs" this
> would translate to "yeah, we entirely disable the i386 ABI for the
> entire system unless you reconfigure it and/or opt-out your old i386
> services".
> 
> Hence, on x86-64, I figure just masking i386 entirely is a bit too
> drastic a compat breakage for us, no? Masking x32 otoh sounds like a
> safe default to do without breaking too much compat given that x32 is
> on its way out.

Well, I meant "if seccomp filters get generated, default to native ABI".
Right now, it seems most things running from systemd with seccomp
filters are daemons, not user processes? (e.g. ssh.server,
getty@.service, etc have no filtering attached.)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-06-02 15:03             ` Paul Moore
@ 2020-06-02 18:39               ` Kees Cook
  0 siblings, 0 replies; 24+ messages in thread
From: Kees Cook @ 2020-06-02 18:39 UTC (permalink / raw)
  To: Paul Moore
  Cc: Lennart Poettering, Alexei Starovoitov, zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek, Tom Hromatka

On Tue, Jun 02, 2020 at 11:03:31AM -0400, Paul Moore wrote:
> Perhaps others will clarify, but from my reading of this thread there
> is a performance advantage to be gained by limiting the number of
> seccomp filters installed for a given process.

Generally speaking, yes, though obviously the size and layout of a single
filter (i.e. is it a balanced tree?) will still impact the overhead.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* 答复: 答复: 答复: new seccomp mode aims to improve performance
  2020-06-02 18:32                 ` Kees Cook
@ 2020-06-03  4:51                   ` zhujianwei (C)
  0 siblings, 0 replies; 24+ messages in thread
From: zhujianwei (C) @ 2020-06-03  4:51 UTC (permalink / raw)
  To: Kees Cook
  Cc: Alexei Starovoitov, bpf, linux-security-module, Hehuazhen,
	Lennart Poettering, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek

> Given that you're still doing this in syscall_trace_enter(), I imagine
> it could live in secure_computing().

Indeed, We agree with that adding light_syscall_filter in seccomp_computing(). 

> I wonder if aarch64 has higher overhead for calling into the TIF_WORK
> trace stuff? (Or if aarch64's BPF JIT is not as efficient as x86?)

We also guess that there are many possible reasons.
And we think that placing the bitmap filter the further forward the better. Our test results show that placing the bitmap filter forward can solve single filter total overhead. What is your opinion about that?

> Anyway, the functionality here is similar to what I've been working
> on for bitmaps (having a global preallocated bitmap isn't going to be
> upstreamable, but it's good for PoC). The complications are with handling
> differing architecture (for compat systems), tracking/choosing between
> the various basic SECCOMP_RET_* behaviors, etc.

Firstly, thank you for correction in code, yes, it is just a PoC for performance test.
Indeed, your bitmap idea is basicly same with us. And, we are trying to find a solution to improve the seccomp performance for product use. 
So What is your plan to have it done? 
Could we do something to help you proceed it?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new seccomp mode aims to improve performance
  2020-06-02 12:44           ` Lennart Poettering
  2020-06-02 18:37             ` Kees Cook
@ 2020-06-16  6:00             ` Kees Cook
  1 sibling, 0 replies; 24+ messages in thread
From: Kees Cook @ 2020-06-16  6:00 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Alexei Starovoitov, zhujianwei (C),
	bpf, linux-security-module, Hehuazhen, Christian Ehrhardt,
	Zbigniew Jędrzejewski-Szmek

On Tue, Jun 02, 2020 at 02:44:31PM +0200, Lennart Poettering wrote:
> We have that actually, it's this line you pasted above:
> 
>         SystemCallArchitectures=native
> 
> It means: block all syscall ABIs but the native one for all processes
> of this service.

Gotcha. And I see this now as I'm working on the code to generating
bitmaps automatically.  After systemd-resolved applies the 26th filter,
the "compat" bitmap goes from effectively a duplicate of the native
syscall map to blocking everything.

after filter 25:
...
[    5.405296] seccomp: syscall bitmap: compat     0-4: SECCOMP_RET_ALLOW
[    5.405297] seccomp: syscall bitmap: compat       5: filter
...
[    5.405326] seccomp: syscall bitmap: compat     380: filter
[    5.405327] seccomp: syscall bitmap: compat 381-439: SECCOMP_RET_ALLOW

after filter 26:
...
[    5.405498] seccomp: syscall bitmap: compat   0-439: SECCOMP_RET_KILL_THREAD

So that seems to be working as expected. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-06-16  6:01 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-29 12:48 new seccomp mode aims to improve performance zhujianwei (C)
2020-05-29 15:43 ` Alexei Starovoitov
2020-05-29 16:09   ` Kees Cook
2020-05-29 17:31     ` Alexei Starovoitov
2020-05-29 19:27     ` Kees Cook
2020-05-31 17:19       ` Alexei Starovoitov
2020-06-01 18:16         ` Kees Cook
2020-06-01  2:08       ` 答复: " zhujianwei (C)
2020-06-01  3:30         ` Alexei Starovoitov
2020-06-02  2:42           ` 答复: " zhujianwei (C)
2020-06-02  3:24             ` Alexei Starovoitov
2020-06-02 11:13               ` 答复: " zhujianwei (C)
2020-06-02 11:34               ` zhujianwei (C)
2020-06-02 18:32                 ` Kees Cook
2020-06-03  4:51                   ` 答复: " zhujianwei (C)
2020-06-01 10:11       ` Lennart Poettering
2020-06-01 12:32         ` Paul Moore
2020-06-02 12:53           ` Lennart Poettering
2020-06-02 15:03             ` Paul Moore
2020-06-02 18:39               ` Kees Cook
2020-06-01 18:21         ` Kees Cook
2020-06-02 12:44           ` Lennart Poettering
2020-06-02 18:37             ` Kees Cook
2020-06-16  6:00             ` Kees Cook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).