* new seccomp mode aims to improve performance @ 2020-05-29 12:48 zhujianwei (C) 2020-05-29 15:43 ` Alexei Starovoitov 0 siblings, 1 reply; 24+ messages in thread From: zhujianwei (C) @ 2020-05-29 12:48 UTC (permalink / raw) To: bpf, linux-security-module; +Cc: Hehuazhen Hi, all We're using seccomp to increase container security, but bpf rules filter causes performance to deteriorate. So, is there a good solution to improve performance, or can we add a simplified seccomp mode to improve performance? // Pseudo code int __secure_computing(int this_syscall) { ... switch (mode) { case SECCOMP_MODE_STRICT: ... case SECCOMP_MODE_FILTER: ... case SECCOMP_MODE_LIGHT_FILTER: //do light syscall filter. ... break; } ... } int light_syscall_filter(int syscall_num) { if(scno > SYSNUM_MAX) { ... return -EACCESS; } bool *filter_map = get_filter_map(current); if(filter_map == NULL) { ... return -EFAULT; } if(filter_map[syscall_num] == true) { ... return 0; } else { ... return -EACCESS; } ... } ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-05-29 12:48 new seccomp mode aims to improve performance zhujianwei (C) @ 2020-05-29 15:43 ` Alexei Starovoitov 2020-05-29 16:09 ` Kees Cook 0 siblings, 1 reply; 24+ messages in thread From: Alexei Starovoitov @ 2020-05-29 15:43 UTC (permalink / raw) To: zhujianwei (C); +Cc: bpf, linux-security-module, Hehuazhen, Kees Cook On Fri, May 29, 2020 at 5:50 AM zhujianwei (C) <zhujianwei7@huawei.com> wrote: > > Hi, all > > We're using seccomp to increase container security, but bpf rules filter causes performance to deteriorate. So, is there a good solution to improve performance, or can we add a simplified seccomp mode to improve performance? I don't think your hunch at where cpu is spending cycles is correct. Could you please do two experiments: 1. try trivial seccomp bpf prog that simply returns 'allow' 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code that returns 'allow' and make sure it's noinline or in a different C file, so that compiler doesn't optimize the whole seccomp_run_filters() into a nop. Then measure performance of both. I bet you'll see exactly the same numbers. If you have retpoline on then bpf case will be slightly slower because of retpoline cost. Only after this experiment let's discuss the options about accelerating seccomp. > > // Pseudo code > int __secure_computing(int this_syscall) > { > ... > switch (mode) { > case SECCOMP_MODE_STRICT: > ... > case SECCOMP_MODE_FILTER: > ... > case SECCOMP_MODE_LIGHT_FILTER: > //do light syscall filter. > ... > break; > } > ... > } > > int light_syscall_filter(int syscall_num) { > if(scno > SYSNUM_MAX) { > ... > return -EACCESS; > } > > bool *filter_map = get_filter_map(current); > if(filter_map == NULL) { > ... > return -EFAULT; > } > > if(filter_map[syscall_num] == true) { > ... > return 0; > } else { > ... > return -EACCESS; > } > ... > } ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-05-29 15:43 ` Alexei Starovoitov @ 2020-05-29 16:09 ` Kees Cook 2020-05-29 17:31 ` Alexei Starovoitov 2020-05-29 19:27 ` Kees Cook 0 siblings, 2 replies; 24+ messages in thread From: Kees Cook @ 2020-05-29 16:09 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: zhujianwei (C), bpf, linux-security-module, Hehuazhen On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote: > On Fri, May 29, 2020 at 5:50 AM zhujianwei (C) <zhujianwei7@huawei.com> wrote: > > > > Hi, all > > > > We're using seccomp to increase container security, but bpf rules filter causes performance to deteriorate. So, is there a good solution to improve performance, or can we add a simplified seccomp mode to improve performance? Yes, there are already plans for a simple syscall bitmap[1] seccomp feature. > I don't think your hunch at where cpu is spending cycles is correct. > Could you please do two experiments: > 1. try trivial seccomp bpf prog that simply returns 'allow' > 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code > that returns 'allow' and make sure it's noinline or in a different C file, > so that compiler doesn't optimize the whole seccomp_run_filters() into a nop. > > Then measure performance of both. > I bet you'll see exactly the same numbers. Android has already done this, it appeared to not be the same. Calling into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see if I can get you the numbers. I was frankly quite surprised -- I understood the bulk of the seccomp overhead to be in taking the TIF_WORK path, copying arguments, etc, but something else is going on there. -Kees [1] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/ -- Kees Cook ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-05-29 16:09 ` Kees Cook @ 2020-05-29 17:31 ` Alexei Starovoitov 2020-05-29 19:27 ` Kees Cook 1 sibling, 0 replies; 24+ messages in thread From: Alexei Starovoitov @ 2020-05-29 17:31 UTC (permalink / raw) To: Kees Cook; +Cc: zhujianwei (C), bpf, linux-security-module, Hehuazhen On Fri, May 29, 2020 at 9:09 AM Kees Cook <keescook@chromium.org> wrote: > > On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote: > > On Fri, May 29, 2020 at 5:50 AM zhujianwei (C) <zhujianwei7@huawei.com> wrote: > > > > > > Hi, all > > > > > > We're using seccomp to increase container security, but bpf rules filter causes performance to deteriorate. So, is there a good solution to improve performance, or can we add a simplified seccomp mode to improve performance? > > Yes, there are already plans for a simple syscall bitmap[1] seccomp feature. > > > I don't think your hunch at where cpu is spending cycles is correct. > > Could you please do two experiments: > > 1. try trivial seccomp bpf prog that simply returns 'allow' > > 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code > > that returns 'allow' and make sure it's noinline or in a different C file, > > so that compiler doesn't optimize the whole seccomp_run_filters() into a nop. > > > > Then measure performance of both. > > I bet you'll see exactly the same numbers. > > Android has already done this, it appeared to not be the same. Calling > into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see > if I can get you the numbers. I was frankly quite surprised -- I > understood the bulk of the seccomp overhead to be in taking the TIF_WORK > path, copying arguments, etc, but something else is going on there. Kees, Please show the numbers that prove your point. I've seen people making this mistake over and over again. Intel folks also said that calling into bpf is slow only to be proved wrong. It turned out to be the cost of retpoline and bpf_dispatcher logic resolved it. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-05-29 16:09 ` Kees Cook 2020-05-29 17:31 ` Alexei Starovoitov @ 2020-05-29 19:27 ` Kees Cook 2020-05-31 17:19 ` Alexei Starovoitov ` (2 more replies) 1 sibling, 3 replies; 24+ messages in thread From: Kees Cook @ 2020-05-29 19:27 UTC (permalink / raw) To: Alexei Starovoitov Cc: zhujianwei (C), bpf, linux-security-module, Hehuazhen, Lennart Poettering, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek On Fri, May 29, 2020 at 09:09:28AM -0700, Kees Cook wrote: > On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote: > > I don't think your hunch at where cpu is spending cycles is correct. > > Could you please do two experiments: > > 1. try trivial seccomp bpf prog that simply returns 'allow' > > 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code > > that returns 'allow' and make sure it's noinline or in a different C file, > > so that compiler doesn't optimize the whole seccomp_run_filters() into a nop. > > > > Then measure performance of both. > > I bet you'll see exactly the same numbers. > > Android has already done this, it appeared to not be the same. Calling > into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see > if I can get you the numbers. I was frankly quite surprised -- I > understood the bulk of the seccomp overhead to be in taking the TIF_WORK > path, copying arguments, etc, but something else is going on there. So while it's not the Android measurements, here's what I'm seeing on x86_64 (this is hardly a perfect noiseless benchmark, but sampling error appears to close to 1%): net.core.bpf_jit_enable=0: Benchmarking 16777216 samples... 10.633756139 - 0.004359714 = 10629396425 getpid native: 633 ns 23.008737499 - 10.633967641 = 12374769858 getpid RET_ALLOW 1 filter: 737 ns 36.723141843 - 23.008975696 = 13714166147 getpid RET_ALLOW 2 filters: 817 ns 47.751422021 - 36.723345630 = 11028076391 getpid BPF-less allow: 657 ns Estimated total seccomp overhead for 1 filter: 104 ns Estimated total seccomp overhead for 2 filters: 184 ns Estimated seccomp per-filter overhead: 80 ns Estimated seccomp entry overhead: 24 ns Estimated BPF overhead per filter: 80 ns net.core.bpf_jit_enable=1: net.core.bpf_jit_harden=1: Benchmarking 16777216 samples... 31.939978606 - 21.275190689 = 10664787917 getpid native: 635 ns 43.324592380 - 31.940794751 = 11383797629 getpid RET_ALLOW 1 filter: 678 ns 55.001650599 - 43.326293248 = 11675357351 getpid RET_ALLOW 2 filters: 695 ns 65.986452855 - 55.002249904 = 10984202951 getpid BPF-less allow: 654 ns Estimated total seccomp overhead for 1 filter: 43 ns Estimated total seccomp overhead for 2 filters: 60 ns Estimated seccomp per-filter overhead: 17 ns Estimated seccomp entry overhead: 26 ns Estimated BPF overhead per filter: 24 ns net.core.bpf_jit_enable=1: net.core.bpf_jit_harden=0: Benchmarking 16777216 samples... 10.684681435 - 0.004198682 = 10680482753 getpid native: 636 ns 22.050823167 - 10.685571417 = 11365251750 getpid RET_ALLOW 1 filter: 677 ns 33.714134291 - 22.051100183 = 11663034108 getpid RET_ALLOW 2 filters: 695 ns 44.793312551 - 33.714383001 = 11078929550 getpid BPF-less allow: 660 ns Estimated total seccomp overhead for 1 filter: 41 ns Estimated total seccomp overhead for 2 filters: 59 ns Estimated seccomp per-filter overhead: 18 ns Estimated seccomp entry overhead: 23 ns Estimated BPF overhead per filter: 17 ns The above is from my (very dangerous!) benchmarking patch[1]. So, with the layered nature of seccomp filters there's a reasonable gain to be seen for a O(1) bitmap lookup to skip running even a single filter, even for the fastest BPF mode. Not that we need to optimize for the pathological case, but this would be especially useful for cases like systemd, which appears to be constructing seccomp filters very inefficiently maybe on a per-syscall[3] basis? For example, systemd-resolved has 32 (!) seccomp filters attached[2]: # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status Seccomp_filters: 32 # grep SystemCall /lib/systemd/system/systemd-resolved.service SystemCallArchitectures=native SystemCallErrorNumber=EPERM SystemCallFilter=@system-service I'd like to better understand what they're doing, but haven't had time to dig in. (The systemd devel mailing list requires subscription, so I've directly CCed some systemd folks that have touched seccomp there recently. Hi! The starts of this thread is here[4].) -Kees [1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=seccomp/benchmark-bpf&id=20cc7d8f4238ea3bc1798f204bb865f4994cca27 [2] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/seccomp&id=9d06f16f463cef5c445af9738efed2bfe4c64730 [3] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter= [4] https://lore.kernel.org/bpf/c22a6c3cefc2412cad00ae14c1371711@huawei.com/ -- Kees Cook ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-05-29 19:27 ` Kees Cook @ 2020-05-31 17:19 ` Alexei Starovoitov 2020-06-01 18:16 ` Kees Cook 2020-06-01 2:08 ` 答复: " zhujianwei (C) 2020-06-01 10:11 ` Lennart Poettering 2 siblings, 1 reply; 24+ messages in thread From: Alexei Starovoitov @ 2020-05-31 17:19 UTC (permalink / raw) To: Kees Cook Cc: zhujianwei (C), bpf, linux-security-module, Hehuazhen, Lennart Poettering, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek, daniel, netdev On Fri, May 29, 2020 at 12:27:03PM -0700, Kees Cook wrote: > On Fri, May 29, 2020 at 09:09:28AM -0700, Kees Cook wrote: > > On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote: > > > I don't think your hunch at where cpu is spending cycles is correct. > > > Could you please do two experiments: > > > 1. try trivial seccomp bpf prog that simply returns 'allow' > > > 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code > > > that returns 'allow' and make sure it's noinline or in a different C file, > > > so that compiler doesn't optimize the whole seccomp_run_filters() into a nop. > > > > > > Then measure performance of both. > > > I bet you'll see exactly the same numbers. > > > > Android has already done this, it appeared to not be the same. Calling > > into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see > > if I can get you the numbers. I was frankly quite surprised -- I > > understood the bulk of the seccomp overhead to be in taking the TIF_WORK > > path, copying arguments, etc, but something else is going on there. > > So while it's not the Android measurements, here's what I'm seeing on > x86_64 (this is hardly a perfect noiseless benchmark, but sampling error > appears to close to 1%): > > > net.core.bpf_jit_enable=0: > > Benchmarking 16777216 samples... > 10.633756139 - 0.004359714 = 10629396425 > getpid native: 633 ns > 23.008737499 - 10.633967641 = 12374769858 > getpid RET_ALLOW 1 filter: 737 ns > 36.723141843 - 23.008975696 = 13714166147 > getpid RET_ALLOW 2 filters: 817 ns > 47.751422021 - 36.723345630 = 11028076391 > getpid BPF-less allow: 657 ns > Estimated total seccomp overhead for 1 filter: 104 ns > Estimated total seccomp overhead for 2 filters: 184 ns > Estimated seccomp per-filter overhead: 80 ns > Estimated seccomp entry overhead: 24 ns > Estimated BPF overhead per filter: 80 ns > > > net.core.bpf_jit_enable=1: > net.core.bpf_jit_harden=1: > > Benchmarking 16777216 samples... > 31.939978606 - 21.275190689 = 10664787917 > getpid native: 635 ns > 43.324592380 - 31.940794751 = 11383797629 > getpid RET_ALLOW 1 filter: 678 ns > 55.001650599 - 43.326293248 = 11675357351 > getpid RET_ALLOW 2 filters: 695 ns > 65.986452855 - 55.002249904 = 10984202951 > getpid BPF-less allow: 654 ns > Estimated total seccomp overhead for 1 filter: 43 ns > Estimated total seccomp overhead for 2 filters: 60 ns > Estimated seccomp per-filter overhead: 17 ns > Estimated seccomp entry overhead: 26 ns > Estimated BPF overhead per filter: 24 ns > > > net.core.bpf_jit_enable=1: > net.core.bpf_jit_harden=0: > > Benchmarking 16777216 samples... > 10.684681435 - 0.004198682 = 10680482753 > getpid native: 636 ns > 22.050823167 - 10.685571417 = 11365251750 > getpid RET_ALLOW 1 filter: 677 ns > 33.714134291 - 22.051100183 = 11663034108 > getpid RET_ALLOW 2 filters: 695 ns > 44.793312551 - 33.714383001 = 11078929550 > getpid BPF-less allow: 660 ns > Estimated total seccomp overhead for 1 filter: 41 ns > Estimated total seccomp overhead for 2 filters: 59 ns > Estimated seccomp per-filter overhead: 18 ns > Estimated seccomp entry overhead: 23 ns > Estimated BPF overhead per filter: 17 ns > > > The above is from my (very dangerous!) benchmarking patch[1]. Thank you for crafting a benchmark. The only thing that it's not doing a fair comparison. The problem with that patch [1] that is using: static noinline u32 __seccomp_benchmark(struct bpf_prog *prog, const struct seccomp_data *sd) { return SECCOMP_RET_ALLOW; } as a benchmarking function. The 'noinline' keyword tells the compiler to keep the body of the function, but the compiler is still doing full control and data flow analysis though this function and it is smart enough to optimize its usage in seccomp_run_filters() and in __seccomp_filter() because all functions are in a single .c file. Lots of code gets optimized away when 'f->benchmark' is on. To make it into fair comparison I've added the following patch on top of your [1]. diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 2fdbf5ad8372..86204422e096 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -244,7 +244,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) return 0; } -static noinline u32 __seccomp_benchmark(struct bpf_prog *prog, +__weak noinline u32 __seccomp_benchmark(struct bpf_prog *prog, const struct seccomp_data *sd) Please take a look at 'make kernel/seccomp.s' before and after to see the difference __weak keyword makes. And here is what seccomp_benchmark now reports: Benchmarking 33554432 samples... 22.618269641 - 15.030812794 = 7587456847 getpid native: 226 ns 30.792042986 - 22.619048831 = 8172994155 getpid RET_ALLOW 1 filter: 243 ns 39.451435038 - 30.792836778 = 8658598260 getpid RET_ALLOW 2 filters: 258 ns 47.616011529 - 39.452190830 = 8163820699 getpid BPF-less allow: 243 ns Estimated total seccomp overhead for 1 filter: 17 ns Estimated total seccomp overhead for 2 filters: 32 ns Estimated seccomp per-filter overhead: 15 ns Estimated seccomp entry overhead: 2 ns Estimated BPF overhead per filter: 0 ns Depending on the run BPF-less mode would be slower than with BPF ;) Benchmarking 33554432 samples... 22.602737193 - 15.078827612 = 7523909581 getpid native: 224 ns 30.734009056 - 22.603540911 = 8130468145 getpid RET_ALLOW 1 filter: 242 ns 39.106701659 - 30.734762631 = 8371939028 getpid RET_ALLOW 2 filters: 249 ns 47.326509567 - 39.107552786 = 8218956781 getpid BPF-less allow: 244 ns Estimated total seccomp overhead for 1 filter: 18 ns Estimated total seccomp overhead for 2 filters: 25 ns Estimated seccomp per-filter overhead: 7 ns Estimated seccomp entry overhead: 11 ns Estimated BPF overhead per filter: 18446744073709551614 ns Above numbers were obtained on non-debug kernel with retpoline off and net.core.bpf_jit_enable=1. When retpoline=y the f->benchmark mode will be slightly faster due to retpoline overhead. If retpoline is a concern the bpf dispatcher logic can be applied to seccomp. It eliminiated retpoline overhead in XDP/bpf fast path. > So, with the layered nature of seccomp filters there's a reasonable gain > to be seen for a O(1) bitmap lookup to skip running even a single filter, > even for the fastest BPF mode. This is not true. The O(1) bitmap implemented as kernel C code will have exactly the same speed as O(1) bitmap implemented as eBPF program. > Not that we need to optimize for the pathological case, but this would > be especially useful for cases like systemd, which appears to be > constructing seccomp filters very inefficiently maybe on a per-syscall[3] > basis? For example, systemd-resolved has 32 (!) seccomp filters > attached[2]: > > # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status > Seccomp_filters: 32 > > # grep SystemCall /lib/systemd/system/systemd-resolved.service > SystemCallArchitectures=native > SystemCallErrorNumber=EPERM > SystemCallFilter=@system-service > > I'd like to better understand what they're doing, but haven't had time > to dig in. (The systemd devel mailing list requires subscription, so > I've directly CCed some systemd folks that have touched seccomp there > recently. Hi! The starts of this thread is here[4].) 32 seccomp filters sounds like a lot. Would be great to find out what are they doing and whether they can be optimized into much shorter and faster eBPF program. > -Kees > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=seccomp/benchmark-bpf&id=20cc7d8f4238ea3bc1798f204bb865f4994cca27 > [2] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/seccomp&id=9d06f16f463cef5c445af9738efed2bfe4c64730 > [3] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter= > [4] https://lore.kernel.org/bpf/c22a6c3cefc2412cad00ae14c1371711@huawei.com/ > > -- > Kees Cook ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-05-31 17:19 ` Alexei Starovoitov @ 2020-06-01 18:16 ` Kees Cook 0 siblings, 0 replies; 24+ messages in thread From: Kees Cook @ 2020-06-01 18:16 UTC (permalink / raw) To: Alexei Starovoitov Cc: zhujianwei (C), bpf, linux-security-module, Hehuazhen, Lennart Poettering, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek, daniel, netdev On Sun, May 31, 2020 at 10:19:15AM -0700, Alexei Starovoitov wrote: > Thank you for crafting a benchmark. > The only thing that it's not doing a fair comparison. > The problem with that patch [1] that is using: > > static noinline u32 __seccomp_benchmark(struct bpf_prog *prog, > const struct seccomp_data *sd) > { > return SECCOMP_RET_ALLOW; > } > > as a benchmarking function. > The 'noinline' keyword tells the compiler to keep the body of the function, but > the compiler is still doing full control and data flow analysis though this > function and it is smart enough to optimize its usage in seccomp_run_filters() > and in __seccomp_filter() because all functions are in a single .c file. > Lots of code gets optimized away when 'f->benchmark' is on. > > To make it into fair comparison I've added the following patch > on top of your [1]. > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 2fdbf5ad8372..86204422e096 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -244,7 +244,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) > return 0; > } > > -static noinline u32 __seccomp_benchmark(struct bpf_prog *prog, > +__weak noinline u32 __seccomp_benchmark(struct bpf_prog *prog, > const struct seccomp_data *sd) > > Please take a look at 'make kernel/seccomp.s' before and after to see the difference > __weak keyword makes. Ah yeah, thanks. That does bring it up to the same overhead. Nice! > And here is what seccomp_benchmark now reports: > > Benchmarking 33554432 samples... > 22.618269641 - 15.030812794 = 7587456847 > getpid native: 226 ns > 30.792042986 - 22.619048831 = 8172994155 > getpid RET_ALLOW 1 filter: 243 ns > 39.451435038 - 30.792836778 = 8658598260 > getpid RET_ALLOW 2 filters: 258 ns > 47.616011529 - 39.452190830 = 8163820699 > getpid BPF-less allow: 243 ns > Estimated total seccomp overhead for 1 filter: 17 ns > Estimated total seccomp overhead for 2 filters: 32 ns > Estimated seccomp per-filter overhead: 15 ns > Estimated seccomp entry overhead: 2 ns > Estimated BPF overhead per filter: 0 ns > > [...] > > > So, with the layered nature of seccomp filters there's a reasonable gain > > to be seen for a O(1) bitmap lookup to skip running even a single filter, > > even for the fastest BPF mode. > > This is not true. > The O(1) bitmap implemented as kernel C code will have exactly the same speed > as O(1) bitmap implemented as eBPF program. Yes, that'd be true if it was the first (and only) filter. What I'm trying to provide is a mechanism to speed up the syscalls for all attached filters (i.e. create a seccomp fast-path). The reality of seccomp usage is that it's very layered: systemd sets some (or many!), then container runtime sets some, then the process itself might set some. -- Kees Cook ^ permalink raw reply [flat|nested] 24+ messages in thread
* 答复: new seccomp mode aims to improve performance 2020-05-29 19:27 ` Kees Cook 2020-05-31 17:19 ` Alexei Starovoitov @ 2020-06-01 2:08 ` zhujianwei (C) 2020-06-01 3:30 ` Alexei Starovoitov 2020-06-01 10:11 ` Lennart Poettering 2 siblings, 1 reply; 24+ messages in thread From: zhujianwei (C) @ 2020-06-01 2:08 UTC (permalink / raw) To: Kees Cook, Alexei Starovoitov Cc: bpf, linux-security-module, Hehuazhen, Lennart Poettering, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek This is the test result on linux 5.7.0-rc7 for aarch64. And retpoline disabled default. #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 Not affected bpf_jit_enable 1 bpf_jit_harden 0 We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one). The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m): original: seccomp_off: 10684939 seccomp_on_1_filters: 8513805 overhead:19.8% seccomp_on_4_filters: 7105592 overhead:33.0% seccomp_on_32_filters: 2308677 overhead:78.3% after replacing bpf_prog_run_pin_on_cpu: seccomp_off: 10685244 seccomp_on_1_filters: 9146483 overhead:14.1% seccomp_on_4_filters: 8969886 overhead:16.0% seccomp_on_32_filters: 6454372 overhead:39.6% N-filter bpf overhead: 1_filters: 5.7% 4_filters: 17.0% 32_filters: 38.7% // kernel code modification place static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx) { return SECCOMP_RET_ALLOW; } static u32 seccomp_run_filters(const struct seccomp_data *sd, struct seccomp_filter **match) { u32 ret = SECCOMP_RET_ALLOW; ... for (; f; f = f->prev) { - u32 cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd); + u32 cur_ret = bpf_prog_run_pin_on_cpu_allow(f->prog, sd); if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) { ret = cur_ret; *match = f; } } return ret; } // unixbench syscall testcase with seccomp enabled: void set_allow_rules(scmp_filter_ctx ctx) { seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(dup), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(getpid), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(getuid), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(umask), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(seccomp), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(prctl), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(faccessat), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(setitimer), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(rt_sigaction), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(execve), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mprotect), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0); } int main(int argc, char *argv[]) { ... duration = atoi(argv[1]); iter = 0; wake_me(duration, report); scmp_filter_ctx ctx; ctx = seccomp_init(SCMP_ACT_KILL); set_allow_rules(ctx); int cnt; for(cnt = 0; cnt < 32; cnt++) seccomp_load(ctx); while (1) { close(dup(0)); getpid(); getuid(); umask(022); iter++; } ... } -----邮件原件----- 发件人: bpf-owner@vger.kernel.org [mailto:bpf-owner@vger.kernel.org] 代表 Kees Cook 发送时间: 2020年5月30日 3:27 收件人: Alexei Starovoitov <alexei.starovoitov@gmail.com> 抄送: zhujianwei (C) <zhujianwei7@huawei.com>; bpf@vger.kernel.org; linux-security-module@vger.kernel.org; Hehuazhen <hehuazhen@huawei.com>; Lennart Poettering <lennart@poettering.net>; Christian Ehrhardt <christian.ehrhardt@canonical.com>; Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> 主题: Re: new seccomp mode aims to improve performance On Fri, May 29, 2020 at 09:09:28AM -0700, Kees Cook wrote: > On Fri, May 29, 2020 at 08:43:56AM -0700, Alexei Starovoitov wrote: > > I don't think your hunch at where cpu is spending cycles is correct. > > Could you please do two experiments: > > 1. try trivial seccomp bpf prog that simply returns 'allow' > > 2. replace bpf_prog_run_pin_on_cpu() in seccomp.c with C code > > that returns 'allow' and make sure it's noinline or in a different C file, > > so that compiler doesn't optimize the whole seccomp_run_filters() into a nop. > > > > Then measure performance of both. > > I bet you'll see exactly the same numbers. > > Android has already done this, it appeared to not be the same. Calling > into a SECCOMP_RET_ALLOW filter had a surprisingly high cost. I'll see > if I can get you the numbers. I was frankly quite surprised -- I > understood the bulk of the seccomp overhead to be in taking the > TIF_WORK path, copying arguments, etc, but something else is going on there. So while it's not the Android measurements, here's what I'm seeing on x86_64 (this is hardly a perfect noiseless benchmark, but sampling error appears to close to 1%): net.core.bpf_jit_enable=0: Benchmarking 16777216 samples... 10.633756139 - 0.004359714 = 10629396425 getpid native: 633 ns 23.008737499 - 10.633967641 = 12374769858 getpid RET_ALLOW 1 filter: 737 ns 36.723141843 - 23.008975696 = 13714166147 getpid RET_ALLOW 2 filters: 817 ns 47.751422021 - 36.723345630 = 11028076391 getpid BPF-less allow: 657 ns Estimated total seccomp overhead for 1 filter: 104 ns Estimated total seccomp overhead for 2 filters: 184 ns Estimated seccomp per-filter overhead: 80 ns Estimated seccomp entry overhead: 24 ns Estimated BPF overhead per filter: 80 ns net.core.bpf_jit_enable=1: net.core.bpf_jit_harden=1: Benchmarking 16777216 samples... 31.939978606 - 21.275190689 = 10664787917 getpid native: 635 ns 43.324592380 - 31.940794751 = 11383797629 getpid RET_ALLOW 1 filter: 678 ns 55.001650599 - 43.326293248 = 11675357351 getpid RET_ALLOW 2 filters: 695 ns 65.986452855 - 55.002249904 = 10984202951 getpid BPF-less allow: 654 ns Estimated total seccomp overhead for 1 filter: 43 ns Estimated total seccomp overhead for 2 filters: 60 ns Estimated seccomp per-filter overhead: 17 ns Estimated seccomp entry overhead: 26 ns Estimated BPF overhead per filter: 24 ns net.core.bpf_jit_enable=1: net.core.bpf_jit_harden=0: Benchmarking 16777216 samples... 10.684681435 - 0.004198682 = 10680482753 getpid native: 636 ns 22.050823167 - 10.685571417 = 11365251750 getpid RET_ALLOW 1 filter: 677 ns 33.714134291 - 22.051100183 = 11663034108 getpid RET_ALLOW 2 filters: 695 ns 44.793312551 - 33.714383001 = 11078929550 getpid BPF-less allow: 660 ns Estimated total seccomp overhead for 1 filter: 41 ns Estimated total seccomp overhead for 2 filters: 59 ns Estimated seccomp per-filter overhead: 18 ns Estimated seccomp entry overhead: 23 ns Estimated BPF overhead per filter: 17 ns The above is from my (very dangerous!) benchmarking patch[1]. So, with the layered nature of seccomp filters there's a reasonable gain to be seen for a O(1) bitmap lookup to skip running even a single filter, even for the fastest BPF mode. Not that we need to optimize for the pathological case, but this would be especially useful for cases like systemd, which appears to be constructing seccomp filters very inefficiently maybe on a per-syscall[3] basis? For example, systemd-resolved has 32 (!) seccomp filters attached[2]: # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status Seccomp_filters: 32 # grep SystemCall /lib/systemd/system/systemd-resolved.service SystemCallArchitectures=native SystemCallErrorNumber=EPERM SystemCallFilter=@system-service I'd like to better understand what they're doing, but haven't had time to dig in. (The systemd devel mailing list requires subscription, so I've directly CCed some systemd folks that have touched seccomp there recently. Hi! The starts of this thread is here[4].) -Kees [1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=seccomp/benchmark-bpf&id=20cc7d8f4238ea3bc1798f204bb865f4994cca27 [2] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/seccomp&id=9d06f16f463cef5c445af9738efed2bfe4c64730 [3] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter= [4] https://lore.kernel.org/bpf/c22a6c3cefc2412cad00ae14c1371711@huawei.com/ -- Kees Cook ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-06-01 2:08 ` 答复: " zhujianwei (C) @ 2020-06-01 3:30 ` Alexei Starovoitov 2020-06-02 2:42 ` 答复: " zhujianwei (C) 0 siblings, 1 reply; 24+ messages in thread From: Alexei Starovoitov @ 2020-06-01 3:30 UTC (permalink / raw) To: zhujianwei (C) Cc: Kees Cook, bpf, linux-security-module, Hehuazhen, Lennart Poettering, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek On Sun, May 31, 2020 at 7:08 PM zhujianwei (C) <zhujianwei7@huawei.com> wrote: > > This is the test result on linux 5.7.0-rc7 for aarch64. > And retpoline disabled default. > #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 > Not affected > > bpf_jit_enable 1 > bpf_jit_harden 0 > > We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one). > The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m): > > original: > seccomp_off: 10684939 > seccomp_on_1_filters: 8513805 overhead:19.8% > seccomp_on_4_filters: 7105592 overhead:33.0% > seccomp_on_32_filters: 2308677 overhead:78.3% > > after replacing bpf_prog_run_pin_on_cpu: > seccomp_off: 10685244 > seccomp_on_1_filters: 9146483 overhead:14.1% > seccomp_on_4_filters: 8969886 overhead:16.0% > seccomp_on_32_filters: 6454372 overhead:39.6% > > N-filter bpf overhead: > 1_filters: 5.7% > 4_filters: 17.0% > 32_filters: 38.7% > > // kernel code modification place > static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx) > { > return SECCOMP_RET_ALLOW; > } This is apples to oranges. As explained earlier: https://lore.kernel.org/netdev/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/T/#u Please use __weak instead of static and redo the numbers. ^ permalink raw reply [flat|nested] 24+ messages in thread
* 答复: new seccomp mode aims to improve performance 2020-06-01 3:30 ` Alexei Starovoitov @ 2020-06-02 2:42 ` zhujianwei (C) 2020-06-02 3:24 ` Alexei Starovoitov 0 siblings, 1 reply; 24+ messages in thread From: zhujianwei (C) @ 2020-06-02 2:42 UTC (permalink / raw) To: Alexei Starovoitov Cc: Kees Cook, bpf, linux-security-module, Hehuazhen, Lennart Poettering, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek > > This is the test result on linux 5.7.0-rc7 for aarch64. > And retpoline disabled default. > #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 > Not affected > > bpf_jit_enable 1 > bpf_jit_harden 0 > > We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one). > The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m): > > original: > seccomp_off: 10684939 > seccomp_on_1_filters: 8513805 overhead:19.8% > seccomp_on_4_filters: 7105592 overhead:33.0% > seccomp_on_32_filters: 2308677 overhead:78.3% > > after replacing bpf_prog_run_pin_on_cpu: > seccomp_off: 10685244 > seccomp_on_1_filters: 9146483 overhead:14.1% > seccomp_on_4_filters: 8969886 overhead:16.0% > seccomp_on_32_filters: 6454372 overhead:39.6% > > N-filter bpf overhead: > 1_filters: 5.7% > 4_filters: 17.0% > 32_filters: 38.7% > > // kernel code modification place > static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct > bpf_prog *prog, const void *ctx) { > return SECCOMP_RET_ALLOW; > } >This is apples to oranges. >As explained earlier: >https://lore.kernel.org/netdev/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/T/#u >Please use __weak instead of static and redo the numbers. we have replaced ‘static’ with ‘__weak’, tested with the same way, and got almostly the same result, in our test environment(aarch64). -static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx) +__weak noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx) original: seccomp_off: 10684939 seccomp_on_1_filters: 8513805 overhead:19.8% seccomp_on_4_filters: 7105592 overhead:33.0% seccomp_on_32_filters: 2308677 overhead:78.3% after replacing bpf_prog_run_pin_on_cpu: seccomp_off: 10667195 seccomp_on_1_filters: 9147454 overhead:14.2% seccomp_on_4_filters: 8927605 overhead:16.1% seccomp_on_32_filters: 6355476 overhead:40.6% N-filter bpf overhead: 1_filters: 5.6% 4_filters: 16.9% 32_filters: 37.7% ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 答复: new seccomp mode aims to improve performance 2020-06-02 2:42 ` 答复: " zhujianwei (C) @ 2020-06-02 3:24 ` Alexei Starovoitov 2020-06-02 11:13 ` 答复: " zhujianwei (C) 2020-06-02 11:34 ` zhujianwei (C) 0 siblings, 2 replies; 24+ messages in thread From: Alexei Starovoitov @ 2020-06-02 3:24 UTC (permalink / raw) To: zhujianwei (C) Cc: Kees Cook, bpf, linux-security-module, Hehuazhen, Lennart Poettering, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek On Tue, Jun 02, 2020 at 02:42:35AM +0000, zhujianwei (C) wrote: > > > > This is the test result on linux 5.7.0-rc7 for aarch64. > > And retpoline disabled default. > > #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 > > Not affected > > > > bpf_jit_enable 1 > > bpf_jit_harden 0 > > > > We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one). > > The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m): > > > > original: > > seccomp_off: 10684939 > > seccomp_on_1_filters: 8513805 overhead:19.8% > > seccomp_on_4_filters: 7105592 overhead:33.0% > > seccomp_on_32_filters: 2308677 overhead:78.3% > > > > after replacing bpf_prog_run_pin_on_cpu: > > seccomp_off: 10685244 > > seccomp_on_1_filters: 9146483 overhead:14.1% > > seccomp_on_4_filters: 8969886 overhead:16.0% > > seccomp_on_32_filters: 6454372 overhead:39.6% > > > > N-filter bpf overhead: > > 1_filters: 5.7% > > 4_filters: 17.0% > > 32_filters: 38.7% > > > > // kernel code modification place > > static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct > > bpf_prog *prog, const void *ctx) { > > return SECCOMP_RET_ALLOW; > > } > > >This is apples to oranges. > >As explained earlier: > >https://lore.kernel.org/netdev/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/T/#u > >Please use __weak instead of static and redo the numbers. > > > we have replaced ‘static’ with ‘__weak’, tested with the same way, and got almostly the same result, in our test environment(aarch64). > > -static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx) > +__weak noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct bpf_prog *prog, const void *ctx) > > original: > seccomp_off: 10684939 > seccomp_on_1_filters: 8513805 overhead:19.8% > seccomp_on_4_filters: 7105592 overhead:33.0% > seccomp_on_32_filters: 2308677 overhead:78.3% > > after replacing bpf_prog_run_pin_on_cpu: > seccomp_off: 10667195 > seccomp_on_1_filters: 9147454 overhead:14.2% > seccomp_on_4_filters: 8927605 overhead:16.1% > seccomp_on_32_filters: 6355476 overhead:40.6% are you saying that by replacing 'static' with '__weak' it got slower?! Something doesn't add up. Please check generated assembly. By having such 'static noinline bpf_prog_run_pin_on_cpu' you're telling compiler to remove most of seccomp_run_filters() code which now will return only two possible values. Which further means that large 'switch' statement in __seccomp_filter() is also optimized. populate_seccomp_data() is removed. Etc, etc. That explains 14% vs 19% difference. May be you have some debug on? Like cant_migrate() is not a nop? Or static_branch is not supported? The sure way is to check assembly. ^ permalink raw reply [flat|nested] 24+ messages in thread
* 答复: 答复: new seccomp mode aims to improve performance 2020-06-02 3:24 ` Alexei Starovoitov @ 2020-06-02 11:13 ` zhujianwei (C) 2020-06-02 11:34 ` zhujianwei (C) 1 sibling, 0 replies; 24+ messages in thread From: zhujianwei (C) @ 2020-06-02 11:13 UTC (permalink / raw) To: Alexei Starovoitov Cc: Kees Cook, bpf, linux-security-module, Hehuazhen, Lennart Poettering, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek > > This is the test result on linux 5.7.0-rc7 for aarch64. > > And retpoline disabled default. > > #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 > > Not affected > > > > bpf_jit_enable 1 > > bpf_jit_harden 0 > > > > We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one). > > The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m): > > > > original: > > seccomp_off: 10684939 > > seccomp_on_1_filters: 8513805 overhead:19.8% > > seccomp_on_4_filters: 7105592 overhead:33.0% > > seccomp_on_32_filters: 2308677 overhead:78.3% > > > > after replacing bpf_prog_run_pin_on_cpu: > > seccomp_off: 10685244 > > seccomp_on_1_filters: 9146483 overhead:14.1% > > seccomp_on_4_filters: 8969886 overhead:16.0% > > seccomp_on_32_filters: 6454372 overhead:39.6% > > > > N-filter bpf overhead: > > 1_filters: 5.7% > > 4_filters: 17.0% > > 32_filters: 38.7% > > > > // kernel code modification place > > static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct > > bpf_prog *prog, const void *ctx) { > > return SECCOMP_RET_ALLOW; > > } > > >This is apples to oranges. > >As explained earlier: > >https://lore.kernel.org/netdev/20200531171915.wsxvdjeetmhpsdv2@ast-mb > >p.dhcp.thefacebook.com/T/#u Please use __weak instead of static and > >redo the numbers. > > > we have replaced ‘static’ with ‘__weak’, tested with the same way, and got almostly the same result, in our test environment(aarch64). > > -static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct > bpf_prog *prog, const void *ctx) > +__weak noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct > +bpf_prog *prog, const void *ctx) > > original: > seccomp_off: 10684939 > seccomp_on_1_filters: 8513805 overhead:19.8% > seccomp_on_4_filters: 7105592 overhead:33.0% > seccomp_on_32_filters: 2308677 overhead:78.3% > > after replacing bpf_prog_run_pin_on_cpu: > seccomp_off: 10667195 > seccomp_on_1_filters: 9147454 overhead:14.2% > seccomp_on_4_filters: 8927605 overhead:16.1% > seccomp_on_32_filters: 6355476 overhead:40.6% > are you saying that by replacing 'static' with '__weak' it got slower?! > Something doesn't add up. Please check generated assembly. > By having such 'static noinline bpf_prog_run_pin_on_cpu' you're telling compiler to remove most of seccomp_run_filters() code which now will return only two possible values. Which further means that large 'switch' > statement in __seccomp_filter() is also optimized. populate_seccomp_data() is removed. Etc, etc. That explains 14% vs 19% difference. > May be you have some debug on? Like cant_migrate() is not a nop? > Or static_branch is not supported? > The sure way is to check assembly. No, we say that by replacing 'static' with '__weak' it got the same result, in our testcase which filters 20 allowed syscall num (for details, see the previous post). static case: N-filter bpf overhead: 1_filters: 5.7% 4_filters: 17.0% 32_filters: 38.7% __weak case: N-filter bpf overhead: 1_filters: 5.6% 4_filters: 16.9% 32_filters: 37.7% ^ permalink raw reply [flat|nested] 24+ messages in thread
* 答复: 答复: new seccomp mode aims to improve performance 2020-06-02 3:24 ` Alexei Starovoitov 2020-06-02 11:13 ` 答复: " zhujianwei (C) @ 2020-06-02 11:34 ` zhujianwei (C) 2020-06-02 18:32 ` Kees Cook 1 sibling, 1 reply; 24+ messages in thread From: zhujianwei (C) @ 2020-06-02 11:34 UTC (permalink / raw) To: Alexei Starovoitov, Kees Cook Cc: bpf, linux-security-module, Hehuazhen, Lennart Poettering, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek > > > This is the test result on linux 5.7.0-rc7 for aarch64. > > > And retpoline disabled default. > > > #cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 > > > Not affected > > > > > > bpf_jit_enable 1 > > > bpf_jit_harden 0 > > > > > > We run unixbench syscall benchmark on the original kernel and the new one(replace bpf_prog_run_pin_on_cpu() with immediately returning 'allow' one). > > > The unixbench syscall testcase runs 5 system calls(close/umask/dup/getpid/getuid, extra 15 syscalls needed to run it) in a loop for 10 seconds, counts the number and finally output it. We also add some more filters (each with the same rules) to evaluate the situation just like kees mentioned(case like systemd-resolve), and we find it is right: more filters, more overhead. The following is our result (./syscall 10 m): > > > > > > original: > > > seccomp_off: 10684939 > > > seccomp_on_1_filters: 8513805 overhead:19.8% > > > seccomp_on_4_filters: 7105592 overhead:33.0% > > > seccomp_on_32_filters: 2308677 overhead:78.3% > > > > > > after replacing bpf_prog_run_pin_on_cpu: > > > seccomp_off: 10685244 > > > seccomp_on_1_filters: 9146483 overhead:14.1% > > > seccomp_on_4_filters: 8969886 overhead:16.0% > > > seccomp_on_32_filters: 6454372 overhead:39.6% > > > > > > N-filter bpf overhead: > > > 1_filters: 5.7% > > > 4_filters: 17.0% > > > 32_filters: 38.7% > > > > > > // kernel code modification place > > > static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct > > > bpf_prog *prog, const void *ctx) { > > > return SECCOMP_RET_ALLOW; > > > } > > > > >This is apples to oranges. > > >As explained earlier: > > >https://lore.kernel.org/netdev/20200531171915.wsxvdjeetmhpsdv2@ast-mb > > >p.dhcp.thefacebook.com/T/#u Please use __weak instead of static and > > >redo the numbers. > > > > > > we have replaced ‘static’ with ‘__weak’, tested with the same way, and got almostly the same result, in our test environment(aarch64). > > > > -static noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct > > bpf_prog *prog, const void *ctx) > > +__weak noinline u32 bpf_prog_run_pin_on_cpu_allow(const struct > > +bpf_prog *prog, const void *ctx) > > > > original: > > seccomp_off: 10684939 > > seccomp_on_1_filters: 8513805 overhead:19.8% > > seccomp_on_4_filters: 7105592 overhead:33.0% > > seccomp_on_32_filters: 2308677 overhead:78.3% > > > > after replacing bpf_prog_run_pin_on_cpu: > > seccomp_off: 10667195 > > seccomp_on_1_filters: 9147454 overhead:14.2% > > seccomp_on_4_filters: 8927605 overhead:16.1% > > seccomp_on_32_filters: 6355476 overhead:40.6% > > are you saying that by replacing 'static' with '__weak' it got slower?! > > Something doesn't add up. Please check generated assembly. > > By having such 'static noinline bpf_prog_run_pin_on_cpu' you're telling compiler to remove most of seccomp_run_filters() code which now will return only two possible values. Which further means that large 'switch' > > statement in __seccomp_filter() is also optimized. populate_seccomp_data() is removed. Etc, etc. That explains 14% vs 19% difference. > > May be you have some debug on? Like cant_migrate() is not a nop? > > Or static_branch is not supported? > > The sure way is to check assembly. > No, we say that by replacing 'static' with '__weak' it got the same result, in our testcase which filters 20 allowed syscall num (for details, see the previous post). > > static case: > N-filter bpf overhead: > 1_filters: 5.7% > 4_filters: 17.0% > 32_filters: 38.7% > > __weak case: > N-filter bpf overhead: > 1_filters: 5.6% > 4_filters: 16.9% > 32_filters: 37.7% And in many scenarios, the requirement for syscall filter is usually simple, and does not need complex filter rules, for example, just configure a syscall black or white list. However, we have noticed that seccomp will have a performance overhead that cannot be ignored in this simple scenario. For example, referring to Kees's t est data, this cost is almost 41/636 = 6.5%, and Alex's data is 17/226 = 7.5%, based on single rule of filtering (getpid); Our data for this overhead is 19.8% (refer to the previous 'orignal' test results), filtering based on our 20 rules (unixbench syscall). To improve the filtering performance in these scenarios, as a PoC, we replaced calling seccomp_computing() in syscall_trace_enter(), with a light_syscall_filter() which uses a bitmap for syscall filter, and only filter syscall-num-only case without the syscall-args case. The light_syscall_filter() implemented as a sub-branch in seccomp_computing(). To measure the performance, we use the same unixbench syscall testcase with 20 allowed syscall-num rules. The result shows that the light_syscall_filter only imposed 1.2% overhead. Can we take a deep discussion to add this light filter mode? This is the details: light_syscall_filter: //run unixbench syscall testcase 10 times and get the average. seccomp_off: 10684018 seccomp_on_1_filters: 10553233 overhead: 1.2% // kernel modification --- linux-5.7-rc7_1/arch/arm64/kernel/ptrace.c 2020-05-25 06:32:54.000000000 +0800 +++ linux-5.7-rc7/arch/arm64/kernel/ptrace.c 2020-06-02 12:35:04.412000000 +0800 @@ -1827,6 +1827,46 @@ regs->regs[regno] = saved_reg; } +#define PID_MAX 1000000 +#define SYSNUM_MAX 0x220 + +/* all zero*/ +bool g_light_filter_switch[PID_MAX] = {0}; +bool g_light_filter_bitmap[PID_MAX][SYSNUM_MAX] = {0}; + + +static int __light_syscall_filter(void) { + int pid; + int this_syscall; + + pid = current->pid; + this_syscall = syscall_get_nr(current, task_pt_regs(current)); + + if(g_light_filter_bitmap[pid][this_syscall] == true) { + printk(KERN_ERR "light syscall filter: syscall num %d denied.\n", this_syscall); + goto skip; + } + + return 0; +skip: + return -1; +} + +static inline int light_syscall_filter(void) { + if (unlikely(test_thread_flag(TIF_SECCOMP))) { + return __light_syscall_filter(); + } + + return 0; +} + int syscall_trace_enter(struct pt_regs *regs) { unsigned long flags = READ_ONCE(current_thread_info()->flags); @@ -1837,9 +1877,10 @@ return -1; } - /* Do the secure computing after ptrace; failures should be fast. */ - if (secure_computing() == -1) + /* light check for syscall-num-only rule. */ + if (light_syscall_filter() == -1) { return -1; + } if (test_thread_flag(TIF_SYSCALL_TRACEPOINT)) trace_sys_enter(regs, regs->syscallno); ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 答复: 答复: new seccomp mode aims to improve performance 2020-06-02 11:34 ` zhujianwei (C) @ 2020-06-02 18:32 ` Kees Cook 2020-06-03 4:51 ` 答复: " zhujianwei (C) 0 siblings, 1 reply; 24+ messages in thread From: Kees Cook @ 2020-06-02 18:32 UTC (permalink / raw) To: zhujianwei (C) Cc: Alexei Starovoitov, bpf, linux-security-module, Hehuazhen, Lennart Poettering, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek On Tue, Jun 02, 2020 at 11:34:04AM +0000, zhujianwei (C) wrote: > And in many scenarios, the requirement for syscall filter is usually > simple, and does not need complex filter rules, for example, just > configure a syscall black or white list. However, we have noticed that > seccomp will have a performance overhead that cannot be ignored in this > simple scenario. For example, referring to Kees's t est data, this cost > is almost 41/636 = 6.5%, and Alex's data is 17/226 = 7.5%, based on > single rule of filtering (getpid); Our data for this overhead is 19.8% > (refer to the previous 'orignal' test results), filtering based on our > 20 rules (unixbench syscall). I wonder if aarch64 has higher overhead for calling into the TIF_WORK trace stuff? (Or if aarch64's BPF JIT is not as efficient as x86?) > // kernel modification > --- linux-5.7-rc7_1/arch/arm64/kernel/ptrace.c 2020-05-25 06:32:54.000000000 +0800 > +++ linux-5.7-rc7/arch/arm64/kernel/ptrace.c 2020-06-02 12:35:04.412000000 +0800 > @@ -1827,6 +1827,46 @@ > regs->regs[regno] = saved_reg; > } > > +#define PID_MAX 1000000 > +#define SYSNUM_MAX 0x220 You can use NR_syscalls here, I think. > + > +/* all zero*/ > +bool g_light_filter_switch[PID_MAX] = {0}; > +bool g_light_filter_bitmap[PID_MAX][SYSNUM_MAX] = {0}; These can be static, and I would double-check your allocation size -- I suspect this is allocating a byte for each bool. I would recommend DECLARE_BITMAP() and friends. > +static int __light_syscall_filter(void) { > + int pid; > + int this_syscall; > + > + pid = current->pid; > + this_syscall = syscall_get_nr(current, task_pt_regs(current)); > + > + if(g_light_filter_bitmap[pid][this_syscall] == true) { > + printk(KERN_ERR "light syscall filter: syscall num %d denied.\n", this_syscall); > + goto skip; > + } > + > + return 0; > +skip: > + return -1; > +} > + > +static inline int light_syscall_filter(void) { > + if (unlikely(test_thread_flag(TIF_SECCOMP))) { > + return __light_syscall_filter(); > + } > + > + return 0; > +} > + > int syscall_trace_enter(struct pt_regs *regs) > { > unsigned long flags = READ_ONCE(current_thread_info()->flags); > @@ -1837,9 +1877,10 @@ > return -1; > } > > - /* Do the secure computing after ptrace; failures should be fast. */ > - if (secure_computing() == -1) > + /* light check for syscall-num-only rule. */ > + if (light_syscall_filter() == -1) { > return -1; > + } > > if (test_thread_flag(TIF_SYSCALL_TRACEPOINT)) > trace_sys_enter(regs, regs->syscallno); Given that you're still doing this in syscall_trace_enter(), I imagine it could live in secure_computing(). Anyway, the functionality here is similar to what I've been working on for bitmaps (having a global preallocated bitmap isn't going to be upstreamable, but it's good for PoC). The complications are with handling differing architecture (for compat systems), tracking/choosing between the various basic SECCOMP_RET_* behaviors, etc. -Kees -- Kees Cook ^ permalink raw reply [flat|nested] 24+ messages in thread
* 答复: 答复: 答复: new seccomp mode aims to improve performance 2020-06-02 18:32 ` Kees Cook @ 2020-06-03 4:51 ` zhujianwei (C) 0 siblings, 0 replies; 24+ messages in thread From: zhujianwei (C) @ 2020-06-03 4:51 UTC (permalink / raw) To: Kees Cook Cc: Alexei Starovoitov, bpf, linux-security-module, Hehuazhen, Lennart Poettering, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek > Given that you're still doing this in syscall_trace_enter(), I imagine > it could live in secure_computing(). Indeed, We agree with that adding light_syscall_filter in seccomp_computing(). > I wonder if aarch64 has higher overhead for calling into the TIF_WORK > trace stuff? (Or if aarch64's BPF JIT is not as efficient as x86?) We also guess that there are many possible reasons. And we think that placing the bitmap filter the further forward the better. Our test results show that placing the bitmap filter forward can solve single filter total overhead. What is your opinion about that? > Anyway, the functionality here is similar to what I've been working > on for bitmaps (having a global preallocated bitmap isn't going to be > upstreamable, but it's good for PoC). The complications are with handling > differing architecture (for compat systems), tracking/choosing between > the various basic SECCOMP_RET_* behaviors, etc. Firstly, thank you for correction in code, yes, it is just a PoC for performance test. Indeed, your bitmap idea is basicly same with us. And, we are trying to find a solution to improve the seccomp performance for product use. So What is your plan to have it done? Could we do something to help you proceed it? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-05-29 19:27 ` Kees Cook 2020-05-31 17:19 ` Alexei Starovoitov 2020-06-01 2:08 ` 答复: " zhujianwei (C) @ 2020-06-01 10:11 ` Lennart Poettering 2020-06-01 12:32 ` Paul Moore 2020-06-01 18:21 ` Kees Cook 2 siblings, 2 replies; 24+ messages in thread From: Lennart Poettering @ 2020-06-01 10:11 UTC (permalink / raw) To: Kees Cook Cc: Alexei Starovoitov, zhujianwei (C), bpf, linux-security-module, Hehuazhen, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek On Fr, 29.05.20 12:27, Kees Cook (keescook@chromium.org) wrote: > # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status > Seccomp_filters: 32 > > # grep SystemCall /lib/systemd/system/systemd-resolved.service > SystemCallArchitectures=native > SystemCallErrorNumber=EPERM > SystemCallFilter=@system-service > > I'd like to better understand what they're doing, but haven't had time > to dig in. (The systemd devel mailing list requires subscription, so > I've directly CCed some systemd folks that have touched seccomp there > recently. Hi! The starts of this thread is here[4].) Hmm, so on x86-64 we try to install our seccomp filters three times: for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32 syscall ABI. Not all of the filters we apply work on all ABIs though, because syscalls are available on some but not others, or cannot sensibly be matched on some (because of socketcall, ipc and such multiplexed syscalls). When we fist added support for seccomp filters to systemd we compiled everything into a single filter, and let libseccomp apply it to different archs. But that didn't work out, since libseccomp doesn't tell use when it manages to apply a filter and when not, i.e. to which arch it worked and to which arch it didn't. And since we have some whitelist and some blacklist filters the internal fallback logic of libsecccomp doesn't work for us either, since you never know what you end up with. So we ended up breaking the different settings up into individual filters, and apply them individually and separately for each arch, so that we know exactly what we managed to install and what not, and what we can then know will properly filter and can check in our test suite. Keeping the filters separate made things a lot easier and simpler to debug, and our log output and testing became much less of a black box. We know exactly what worked and what didn't, and our test validate each filter. For systemd-resolved we apply a bunch more filters than just those that are result of SystemCallFilter= and SystemCallArchitectures= (SystemCallFilter= itself synthesizes one filter per syscall ABI). 1. RestrictSUIDSGID= generates a seccomp filter to generated suid/sgid binaries, i.e. filters chmod() and related calls and their arguments 2. LockPersonality= blocks personality() for most arguments 3. MemoryDenyWriteExecute= blocks mmap() and similar calls if the selected map has X and W set at the same time 4. RestrictRealtime= blocks sched_setscheulder() for most parameters 5. RestrictAddressFamilies= blocks socket() and related calls for various address families 6. ProtectKernelLogs= blocks the syslog() syscall for most parameters 7. ProtectKernelTunables= blocks the old _sysctl() syscall among some other things 8. RestrictNamespaces= blocks various unshare() and clone() bits So yeah, if one turns on many of these options in services (and we generally turn on everything we can for the services we ship) and then multiply that by the archs you end up with quite a bunch. If we wanted to optimize that in userspace, then libseccomp would have to be improved quite substantially to let us know exactly what works and what doesn't, and to have sane fallback both when building whitelists and blacklists. An easy improvement is probably if libseccomp would now start refusing to install x32 seccomp filters altogether now that x32 is entirely dead? Or are the entrypoints for x32 syscalls still available in the kernel? How could userspace figure out if they are available? If libseccomp doesn't want to add code for that, we probably could have that in systemd itself too... Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-06-01 10:11 ` Lennart Poettering @ 2020-06-01 12:32 ` Paul Moore 2020-06-02 12:53 ` Lennart Poettering 2020-06-01 18:21 ` Kees Cook 1 sibling, 1 reply; 24+ messages in thread From: Paul Moore @ 2020-06-01 12:32 UTC (permalink / raw) To: Lennart Poettering Cc: Kees Cook, Alexei Starovoitov, zhujianwei (C), bpf, linux-security-module, Hehuazhen, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek, Tom Hromatka On Mon, Jun 1, 2020 at 6:17 AM Lennart Poettering <lennart@poettering.net> wrote: > On Fr, 29.05.20 12:27, Kees Cook (keescook@chromium.org) wrote: > > # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status > > Seccomp_filters: 32 > > > > # grep SystemCall /lib/systemd/system/systemd-resolved.service > > SystemCallArchitectures=native > > SystemCallErrorNumber=EPERM > > SystemCallFilter=@system-service > > > > I'd like to better understand what they're doing, but haven't had time > > to dig in. (The systemd devel mailing list requires subscription, so > > I've directly CCed some systemd folks that have touched seccomp there > > recently. Hi! The starts of this thread is here[4].) > > Hmm, so on x86-64 we try to install our seccomp filters three times: > for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32 > syscall ABI. Not all of the filters we apply work on all ABIs though, > because syscalls are available on some but not others, or cannot > sensibly be matched on some (because of socketcall, ipc and such > multiplexed syscalls). > > When we fist added support for seccomp filters to systemd we compiled > everything into a single filter, and let libseccomp apply it to > different archs. But that didn't work out, since libseccomp doesn't > tell use when it manages to apply a filter and when not, i.e. to which > arch it worked and to which arch it didn't. And since we have some > whitelist and some blacklist filters the internal fallback logic of > libsecccomp doesn't work for us either, since you never know what you > end up with. So we ended up breaking the different settings up into > individual filters, and apply them individually and separately for > each arch, so that we know exactly what we managed to install and what > not, and what we can then know will properly filter and can check in > our test suite. > > Keeping the filters separate made things a lot easier and simpler to > debug, and our log output and testing became much less of a black > box. We know exactly what worked and what didn't, and our test > validate each filter. In situations where the calling application creates multiple per-ABI filters, the seccomp_merge(3) function can be used to merge the filters into one. There are some limitations (same byte ordering, filter attributes, etc.) but in general it should work without problem when merging x86_64, x32, and x86. For what it is worth, libseccomp does handle things like the multiplexed socket syscalls[*] across multiple ABIs, just not quite in the way Lennart and systemd wanted. It is also possible, although I would be a bit surprised, that some of the systemd's concerns have been resolved in modern libseccomp. For better or worse, systemd was one of the first adopters of libseccomp and they had to deal with more than a few bumps as the library was developed. [*] Handling the multiplexed syscalls is tricky, especially when one combines multiple ABIs and the presence of both the multiplexed and direct-wired syscalls on some kernel versions. Recent libseccomp versions do handle all these cases; creating multiplexed filters, direct-wired filters, or both depending on the particular ABI. The problem comes when you try to wrap all of that up in a single library API that works regardless of the ABI and kernel version across different build and runtime environments. This is why we don't support the "exact" variants of the libseccomp API on filters which contain multiple ABIs, we simply can't guarantee that we will always be able to filter on the third argument socket() in a filter than consists of the x86_64 and x86 ABIs. The non-exact API variants create the rules as best they can in this case, creating three rules in the filter: a x86_64 rule which filters on the third argument of socket(), a x86 rule which filters on the third argument of the direct-wired socket(), and a x86 rule which filters on the multiplexed socketcall(socket) syscall (impossible to filter on the syscall argument here). > For systemd-resolved we apply a bunch more filters than just those > that are result of SystemCallFilter= and SystemCallArchitectures= > (SystemCallFilter= itself synthesizes one filter per syscall ABI). ... > So yeah, if one turns on many of these options in services (and we > generally turn on everything we can for the services we ship) and then > multiply that by the archs you end up with quite a bunch. I'm not sure how systemd is architected with respect to seccomp filtering, but once again it would seem like seccomp_merge() could be useful here. > If we wanted to optimize that in userspace, then libseccomp would have > to be improved quite substantially to let us know exactly what works > and what doesn't, and to have sane fallback both when building > whitelists and blacklists. It has been quite a while since we last talked about systemd's use of libseccomp, but the upcoming v2.5.0 release (no date set yet, but think weeks not months) finally takes a first step towards defining proper return values on error for the API, no more "negative values on error". I'm sure there are other things, but I recall this as being one of the bigger systemd wants. As an aside, it is always going to be difficult to allow fine grained control when you have a single libseccomp filter that includes multiple ABIs; the different ABI oddities are just too great (see comments above). If you need exacting control of the filter, or ABI specific handling, then the recommended way is to create those filters independently and merge them together before loading them into the kernel or applying any common rules. > An easy improvement is probably if libseccomp would now start refusing > to install x32 seccomp filters altogether now that x32 is entirely > dead? Or are the entrypoints for x32 syscalls still available in the > kernel? How could userspace figure out if they are available? If > libseccomp doesn't want to add code for that, we probably could have > that in systemd itself too... You can eliminate x32 syscalls today using libseccomp though either the "BADARCH" filter attribute or through a x32 specific filter that defaults to KILL/ERRNO/etc. and has no rules (of course you could merge this x32 filter with your x86_64 filter). While I don't see us removing the ability to create x32 filters from libseccomp any time soon (need to support older kernels), I can say that I would be very happy to see x32 removed from systems. Regardless of what one may think of the wisdom in creating this ABI, I think we can agree the implementation was a bit of a hack. -- paul moore www.paul-moore.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-06-01 12:32 ` Paul Moore @ 2020-06-02 12:53 ` Lennart Poettering 2020-06-02 15:03 ` Paul Moore 0 siblings, 1 reply; 24+ messages in thread From: Lennart Poettering @ 2020-06-02 12:53 UTC (permalink / raw) To: Paul Moore Cc: Kees Cook, Alexei Starovoitov, zhujianwei (C), bpf, linux-security-module, Hehuazhen, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek, Tom Hromatka On Mo, 01.06.20 08:32, Paul Moore (paul@paul-moore.com) wrote: > In situations where the calling application creates multiple per-ABI > filters, the seccomp_merge(3) function can be used to merge the > filters into one. There are some limitations (same byte ordering, > filter attributes, etc.) but in general it should work without problem > when merging x86_64, x32, and x86. Hmm, so we currently only use seccomp_rule_add_exact() to build an individual filter and finally seccomp_load() to install it. Which tells us exactly what works and what does not. If we now would use seccomp_rule_add_exact() to build the filters, but then use seccomp_merge() to merge them all, and then only do a single seccomp_load(), will this give us the same info? i.e. will seccomp_merge() give us the same errors seccomp_load() currently gives us when something cannot work? > > If we wanted to optimize that in userspace, then libseccomp would have > > to be improved quite substantially to let us know exactly what works > > and what doesn't, and to have sane fallback both when building > > whitelists and blacklists. > > It has been quite a while since we last talked about systemd's use of > libseccomp, but the upcoming v2.5.0 release (no date set yet, but > think weeks not months) finally takes a first step towards defining > proper return values on error for the API, no more "negative values on > error". I'm sure there are other things, but I recall this as being > one of the bigger systemd wants. Yes, we care about error codes a lot. > As an aside, it is always going to be difficult to allow fine grained > control when you have a single libseccomp filter that includes > multiple ABIs; the different ABI oddities are just too great (see > comments above). If you need exacting control of the filter, or ABI > specific handling, then the recommended way is to create those filters > independently and merge them together before loading them into the > kernel or applying any common rules. Hmm, so not sure I got this. But are you saying that when using seccomp_merge() am I supposed to merge filters for different archs into one megafilter, or are you saying the opposite: am I supposed not to do that? I.e. in an ideal world, should we come to a situation where per service on x86-64 we will have exactly one filter installed, or should we come to a situation where we'll have exactly three installed, once for each ABI? Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-06-02 12:53 ` Lennart Poettering @ 2020-06-02 15:03 ` Paul Moore 2020-06-02 18:39 ` Kees Cook 0 siblings, 1 reply; 24+ messages in thread From: Paul Moore @ 2020-06-02 15:03 UTC (permalink / raw) To: Lennart Poettering Cc: Kees Cook, Alexei Starovoitov, zhujianwei (C), bpf, linux-security-module, Hehuazhen, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek, Tom Hromatka On Tue, Jun 2, 2020 at 8:53 AM Lennart Poettering <lennart@poettering.net> wrote: > On Mo, 01.06.20 08:32, Paul Moore (paul@paul-moore.com) wrote: > > > In situations where the calling application creates multiple per-ABI > > filters, the seccomp_merge(3) function can be used to merge the > > filters into one. There are some limitations (same byte ordering, > > filter attributes, etc.) but in general it should work without problem > > when merging x86_64, x32, and x86. > > Hmm, so we currently only use seccomp_rule_add_exact() to build an > individual filter and finally seccomp_load() to install it. Which > tells us exactly what works and what does not. > > If we now would use seccomp_rule_add_exact() to build the filters, but > then use seccomp_merge() to merge them all, and then only do a single > seccomp_load(), will this give us the same info? i.e. will > seccomp_merge() give us the same errors seccomp_load() currently gives > us when something cannot work? Yes, it should. The seccomp_merge() API was created for applications, like systemd, that create multiple per-ABI filters but still wanted to consolidate the multiple filters into one, either to add common rules or reduce the number of filters loaded into the kernel. Since the motivation behind the seccomp_merge() API was to help make it easier to create complex filters for multiple ABIs, it is limited to merging filters that do not share any of the same ABIs. For example, you could merge a filter that contains a x86 ABI with another filter that contains x86_64 and x32 ABIs but you can *not* merge two filters that both contain x86 ABIs. This greatly simplifies the merge operation and makes it fairly quick. If you try it and run into any problems let Tom (CC'd) or I know and we will try to help you guys out. > > > If we wanted to optimize that in userspace, then libseccomp would have > > > to be improved quite substantially to let us know exactly what works > > > and what doesn't, and to have sane fallback both when building > > > whitelists and blacklists. > > > > It has been quite a while since we last talked about systemd's use of > > libseccomp, but the upcoming v2.5.0 release (no date set yet, but > > think weeks not months) finally takes a first step towards defining > > proper return values on error for the API, no more "negative values on > > error". I'm sure there are other things, but I recall this as being > > one of the bigger systemd wants. > > Yes, we care about error codes a lot. The lack of any real stability in error codes was my fault, we really should have done it sooner but I felt there were bigger issues to focus on at the time. Regardless, starting with v2.5.0 we will have stable error codes that are documented in the manpages; the number of errors, and the associated documentation, may be relatively limited to start with, but it should improve over time as we get feedback. > > As an aside, it is always going to be difficult to allow fine grained > > control when you have a single libseccomp filter that includes > > multiple ABIs; the different ABI oddities are just too great (see > > comments above). If you need exacting control of the filter, or ABI > > specific handling, then the recommended way is to create those filters > > independently and merge them together before loading them into the > > kernel or applying any common rules. > > Hmm, so not sure I got this. But are you saying that when using > seccomp_merge() am I supposed to merge filters for different archs > into one megafilter, or are you saying the opposite: am I supposed not > to do that? Okay, let me try this again ... In the context of this discussion it looks like limiting the number of seccomp filters loaded into the kernel is desirable and one way to do that with libseccomp is to use the seccomp_merge() API to consolidate multiple per-ABI filters into one "megafilter" (I like that term) which can be loaded into the kernel. Does that make more sense? > I.e. in an ideal world, should we come to a situation where per > service on x86-64 we will have exactly one filter installed, or should > we come to a situation where we'll have exactly three installed, once > for each ABI? Perhaps others will clarify, but from my reading of this thread there is a performance advantage to be gained by limiting the number of seccomp filters installed for a given process. -- paul moore www.paul-moore.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-06-02 15:03 ` Paul Moore @ 2020-06-02 18:39 ` Kees Cook 0 siblings, 0 replies; 24+ messages in thread From: Kees Cook @ 2020-06-02 18:39 UTC (permalink / raw) To: Paul Moore Cc: Lennart Poettering, Alexei Starovoitov, zhujianwei (C), bpf, linux-security-module, Hehuazhen, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek, Tom Hromatka On Tue, Jun 02, 2020 at 11:03:31AM -0400, Paul Moore wrote: > Perhaps others will clarify, but from my reading of this thread there > is a performance advantage to be gained by limiting the number of > seccomp filters installed for a given process. Generally speaking, yes, though obviously the size and layout of a single filter (i.e. is it a balanced tree?) will still impact the overhead. -- Kees Cook ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-06-01 10:11 ` Lennart Poettering 2020-06-01 12:32 ` Paul Moore @ 2020-06-01 18:21 ` Kees Cook 2020-06-02 12:44 ` Lennart Poettering 1 sibling, 1 reply; 24+ messages in thread From: Kees Cook @ 2020-06-01 18:21 UTC (permalink / raw) To: Lennart Poettering Cc: Alexei Starovoitov, zhujianwei (C), bpf, linux-security-module, Hehuazhen, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek On Mon, Jun 01, 2020 at 12:11:37PM +0200, Lennart Poettering wrote: > On Fr, 29.05.20 12:27, Kees Cook (keescook@chromium.org) wrote: > > > # grep ^Seccomp_filters /proc/$(pidof systemd-resolved)/status > > Seccomp_filters: 32 > > > > # grep SystemCall /lib/systemd/system/systemd-resolved.service > > SystemCallArchitectures=native > > SystemCallErrorNumber=EPERM > > SystemCallFilter=@system-service > > > > I'd like to better understand what they're doing, but haven't had time > > to dig in. (The systemd devel mailing list requires subscription, so > > I've directly CCed some systemd folks that have touched seccomp there > > recently. Hi! The starts of this thread is here[4].) > > Hmm, so on x86-64 we try to install our seccomp filters three times: > for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32 > syscall ABI. Not all of the filters we apply work on all ABIs though, > because syscalls are available on some but not others, or cannot > sensibly be matched on some (because of socketcall, ipc and such > multiplexed syscalls). > > [...] Thanks for the details on this! That helps me understand what's happening much better. :) > An easy improvement is probably if libseccomp would now start refusing > to install x32 seccomp filters altogether now that x32 is entirely > dead? Or are the entrypoints for x32 syscalls still available in the > kernel? How could userspace figure out if they are available? If > libseccomp doesn't want to add code for that, we probably could have > that in systemd itself too... Would it make sense to provide a systemd setting for services to declare "no compat" or "no x32" (I'm not sure what to call this mode more generically, "no 32-bit allocation ABI"?) Then you can just install a single merged filter for all the native syscalls that starts with "if not native, reject"? (Or better yet: make the default for filtering be "native only", and let services opt into other ABIs?) -- Kees Cook ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-06-01 18:21 ` Kees Cook @ 2020-06-02 12:44 ` Lennart Poettering 2020-06-02 18:37 ` Kees Cook 2020-06-16 6:00 ` Kees Cook 0 siblings, 2 replies; 24+ messages in thread From: Lennart Poettering @ 2020-06-02 12:44 UTC (permalink / raw) To: Kees Cook Cc: Alexei Starovoitov, zhujianwei (C), bpf, linux-security-module, Hehuazhen, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek On Mo, 01.06.20 11:21, Kees Cook (keescook@chromium.org) wrote: > > > # grep SystemCall /lib/systemd/system/systemd-resolved.service > > > SystemCallArchitectures=native > > > SystemCallErrorNumber=EPERM > > > SystemCallFilter=@system-service > > > > > > I'd like to better understand what they're doing, but haven't had time > > > to dig in. (The systemd devel mailing list requires subscription, so > > > I've directly CCed some systemd folks that have touched seccomp there > > > recently. Hi! The starts of this thread is here[4].) > > > > Hmm, so on x86-64 we try to install our seccomp filters three times: > > for the x86-64 syscall ABI, for the i386 syscall ABI and for the x32 > > syscall ABI. Not all of the filters we apply work on all ABIs though, > > because syscalls are available on some but not others, or cannot > > sensibly be matched on some (because of socketcall, ipc and such > > multiplexed syscalls). > > > > [...] > > Thanks for the details on this! That helps me understand what's > happening much better. :) > > > An easy improvement is probably if libseccomp would now start refusing > > to install x32 seccomp filters altogether now that x32 is entirely > > dead? Or are the entrypoints for x32 syscalls still available in the > > kernel? How could userspace figure out if they are available? If > > libseccomp doesn't want to add code for that, we probably could have > > that in systemd itself too... > > Would it make sense to provide a systemd setting for services to declare > "no compat" or "no x32" (I'm not sure what to call this mode more > generically, "no 32-bit allocation ABI"?) Then you can just install > a single merged filter for all the native syscalls that starts with > "if not native, reject"? We have that actually, it's this line you pasted above: SystemCallArchitectures=native It means: block all syscall ABIs but the native one for all processes of this service. We currently use that setting only to synthesize an explicit seccomp filter masking the other ABIs wholesale. We do not use it to suppress generation of other, unrelated seccomp filters for that arch. i.e. which means you might end up with one filter blocking x32 wholesale, but then another unrelated option might install a filter blocking some specific syscall with some specific arguments, but still gets installed for x86-64 *and* i386 *and* x32. I guess we could relatively easily tweak that and suppress the latter. If we did, then on all services that set SystemCallArchitectures=native on x86-64 the number of installed seccomp filters should become a third. > (Or better yet: make the default for filtering be "native only", and > let services opt into other ABIs?) That sounds like it would make people quite unhappy no? given that on a systemd system anything that runs in userspace is ultimately part of a service managed by systemd, if we'd default to "no native ABIs" this would translate to "yeah, we entirely disable the i386 ABI for the entire system unless you reconfigure it and/or opt-out your old i386 services". Hence, on x86-64, I figure just masking i386 entirely is a bit too drastic a compat breakage for us, no? Masking x32 otoh sounds like a safe default to do without breaking too much compat given that x32 is on its way out. Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-06-02 12:44 ` Lennart Poettering @ 2020-06-02 18:37 ` Kees Cook 2020-06-16 6:00 ` Kees Cook 1 sibling, 0 replies; 24+ messages in thread From: Kees Cook @ 2020-06-02 18:37 UTC (permalink / raw) To: Lennart Poettering Cc: Alexei Starovoitov, zhujianwei (C), bpf, linux-security-module, Hehuazhen, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek On Tue, Jun 02, 2020 at 02:44:31PM +0200, Lennart Poettering wrote: > On Mo, 01.06.20 11:21, Kees Cook (keescook@chromium.org) wrote: > > Would it make sense to provide a systemd setting for services to declare > > "no compat" or "no x32" (I'm not sure what to call this mode more > > generically, "no 32-bit allocation ABI"?) Then you can just install > > a single merged filter for all the native syscalls that starts with > > "if not native, reject"? > > We have that actually, it's this line you pasted above: > > SystemCallArchitectures=native > > It means: block all syscall ABIs but the native one for all processes > of this service. > > We currently use that setting only to synthesize an explicit seccomp > filter masking the other ABIs wholesale. We do not use it to suppress > generation of other, unrelated seccomp filters for that > arch. i.e. which means you might end up with one filter blocking x32 > wholesale, but then another unrelated option might install a filter > blocking some specific syscall with some specific arguments, but still > gets installed for x86-64 *and* i386 *and* x32. I guess we could > relatively easily tweak that and suppress the latter. If we did, then > on all services that set SystemCallArchitectures=native on x86-64 the > number of installed seccomp filters should become a third. Right, that's what I meant -- on x86_64 we've got way too many filters installed if we only care about "native" arch. ;) > > (Or better yet: make the default for filtering be "native only", and > > let services opt into other ABIs?) > > That sounds like it would make people quite unhappy no? given that on > a systemd system anything that runs in userspace is ultimately part of > a service managed by systemd, if we'd default to "no native ABIs" this > would translate to "yeah, we entirely disable the i386 ABI for the > entire system unless you reconfigure it and/or opt-out your old i386 > services". > > Hence, on x86-64, I figure just masking i386 entirely is a bit too > drastic a compat breakage for us, no? Masking x32 otoh sounds like a > safe default to do without breaking too much compat given that x32 is > on its way out. Well, I meant "if seccomp filters get generated, default to native ABI". Right now, it seems most things running from systemd with seccomp filters are daemons, not user processes? (e.g. ssh.server, getty@.service, etc have no filtering attached.) -- Kees Cook ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: new seccomp mode aims to improve performance 2020-06-02 12:44 ` Lennart Poettering 2020-06-02 18:37 ` Kees Cook @ 2020-06-16 6:00 ` Kees Cook 1 sibling, 0 replies; 24+ messages in thread From: Kees Cook @ 2020-06-16 6:00 UTC (permalink / raw) To: Lennart Poettering Cc: Alexei Starovoitov, zhujianwei (C), bpf, linux-security-module, Hehuazhen, Christian Ehrhardt, Zbigniew Jędrzejewski-Szmek On Tue, Jun 02, 2020 at 02:44:31PM +0200, Lennart Poettering wrote: > We have that actually, it's this line you pasted above: > > SystemCallArchitectures=native > > It means: block all syscall ABIs but the native one for all processes > of this service. Gotcha. And I see this now as I'm working on the code to generating bitmaps automatically. After systemd-resolved applies the 26th filter, the "compat" bitmap goes from effectively a duplicate of the native syscall map to blocking everything. after filter 25: ... [ 5.405296] seccomp: syscall bitmap: compat 0-4: SECCOMP_RET_ALLOW [ 5.405297] seccomp: syscall bitmap: compat 5: filter ... [ 5.405326] seccomp: syscall bitmap: compat 380: filter [ 5.405327] seccomp: syscall bitmap: compat 381-439: SECCOMP_RET_ALLOW after filter 26: ... [ 5.405498] seccomp: syscall bitmap: compat 0-439: SECCOMP_RET_KILL_THREAD So that seems to be working as expected. :) -- Kees Cook ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2020-06-16 6:00 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-29 12:48 new seccomp mode aims to improve performance zhujianwei (C) 2020-05-29 15:43 ` Alexei Starovoitov 2020-05-29 16:09 ` Kees Cook 2020-05-29 17:31 ` Alexei Starovoitov 2020-05-29 19:27 ` Kees Cook 2020-05-31 17:19 ` Alexei Starovoitov 2020-06-01 18:16 ` Kees Cook 2020-06-01 2:08 ` 答复: " zhujianwei (C) 2020-06-01 3:30 ` Alexei Starovoitov 2020-06-02 2:42 ` 答复: " zhujianwei (C) 2020-06-02 3:24 ` Alexei Starovoitov 2020-06-02 11:13 ` 答复: " zhujianwei (C) 2020-06-02 11:34 ` zhujianwei (C) 2020-06-02 18:32 ` Kees Cook 2020-06-03 4:51 ` 答复: " zhujianwei (C) 2020-06-01 10:11 ` Lennart Poettering 2020-06-01 12:32 ` Paul Moore 2020-06-02 12:53 ` Lennart Poettering 2020-06-02 15:03 ` Paul Moore 2020-06-02 18:39 ` Kees Cook 2020-06-01 18:21 ` Kees Cook 2020-06-02 12:44 ` Lennart Poettering 2020-06-02 18:37 ` Kees Cook 2020-06-16 6:00 ` Kees Cook
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).