* [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls @ 2020-09-21 5:35 YiFei Zhu 2020-09-21 5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu ` (8 more replies) 0 siblings, 9 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-21 5:35 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg From: YiFei Zhu <yifeifz2@illinois.edu> This series adds a bitmap to cache seccomp filter results if the result permits a syscall and is indepenent of syscall arguments. This visibly decreases seccomp overhead for most common seccomp filters with very little memory footprint. The overhead of running Seccomp filters has been part of some past discussions [1][2][3]. Oftentimes, the filters have a large number of instructions that check syscall numbers one by one and jump based on that. Some users chain BPF filters which further enlarge the overhead. A recent work [6] comprehensively measures the Seccomp overhead and shows that the overhead is non-negligible and has a non-trivial impact on application performance. We propose SECCOMP_CACHE, a cache-based solution to minimize the Seccomp overhead. The basic idea is to cache the result of each syscall check to save the subsequent overhead of executing the filters. This is feasible, because the check in Seccomp is stateless. The checking results of the same syscall ID and argument remains the same. We observed some common filters, such as docker's [4] or systemd's [5], will make most decisions based only on the syscall numbers, and as past discussions considered, a bitmap where each bit represents a syscall makes most sense for these filters. In the past Kees proposed [2] to have an "add this syscall to the reject bitmask". It is indeed much easier to securely make a reject accelerator to pre-filter syscalls before passing to the BPF filters, considering it could only strengthen the security provided by the filter. However, ultimately, filter rejections are an exceptional / rare case. Here, instead of accelerating what is rejected, we accelerate what is allowed. In order not to compromise the security rules the BPF filters defined, any accept-side accelerator must complement the BPF filters rather than replacing them. Statically analyzing BPF bytecode to see if each syscall is going to always land in allow or reject is more of a rabbit hole, especially there is no current in-kernel infrastructure to enumerate all the possible architecture numbers for a given machine. So rather than doing that, we propose to cache the results after the BPF filters are run. And since there are filters like docker's who will check arguments of some syscalls, but not all or none of the syscalls, when a filter is loaded we analyze it to find whether each syscall is cacheable (does not access syscall argument or instruction pointer) by following its control flow graph, and store the result for each filter in a bitmap. Changes to architecture number or the filter are expected to be rare and simply cause the cache to be cleared. This solution shall be fully transparent to userspace. Ongoing work is to further support arguments with fast hash table lookups. We are investigating the performance of doing so [6], and how to best integrate with the existing seccomp infrastructure. We have done some benchmarks with patch applied against bpf-next commit 2e80be60c465 ("libbpf: Fix compilation warnings for 64-bit printf args"). Me, in qemu-kvm x86_64 VM, on Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz, average results: Without cache, seccomp_benchmark: Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Calibrating sample size for 15 seconds worth of syscalls ... Benchmarking 23486415 syscalls... 16.079642020 - 1.013345439 = 15066296581 (15.1s) getpid native: 641 ns 32.080237410 - 16.080763500 = 15999473910 (16.0s) getpid RET_ALLOW 1 filter: 681 ns 48.609461618 - 32.081296173 = 16528165445 (16.5s) getpid RET_ALLOW 2 filters: 703 ns Estimated total seccomp overhead for 1 filter: 40 ns Estimated total seccomp overhead for 2 filters: 62 ns Estimated seccomp per-filter overhead: 22 ns Estimated seccomp entry overhead: 18 ns With cache: Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Calibrating sample size for 15 seconds worth of syscalls ... Benchmarking 23486415 syscalls... 16.059512499 - 1.014108434 = 15045404065 (15.0s) getpid native: 640 ns 31.651075934 - 16.060637323 = 15590438611 (15.6s) getpid RET_ALLOW 1 filter: 663 ns 47.367316169 - 31.652302661 = 15715013508 (15.7s) getpid RET_ALLOW 2 filters: 669 ns Estimated total seccomp overhead for 1 filter: 23 ns Estimated total seccomp overhead for 2 filters: 29 ns Estimated seccomp per-filter overhead: 6 ns Estimated seccomp entry overhead: 17 ns Depending on the run estimated seccomp overhead for 2 filters can be less than seccomp overhead for 1 filter, resulting in underflow to estimated seccomp per-filter overhead: Estimated total seccomp overhead for 1 filter: 27 ns Estimated total seccomp overhead for 2 filters: 21 ns Estimated seccomp per-filter overhead: 18446744073709551610 ns Estimated seccomp entry overhead: 33 ns Jack Chen has also run some benchmarks on a bare metal Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz, with side channel mitigations off (spec_store_bypass_disable=off spectre_v2=off mds=off pti=off l1tf=off), with BPF JIT on and docker default profile, and reported: unixbench syscall mix (https://github.com/kdlucas/byte-unixbench) unconfined: 33295685 docker default: 20661056 60% docker default + cache: 25719937 30% Patch 1 introduces the static analyzer to check for a given filter, whether the CFG loads the syscall arguments for each syscall number. Patch 2 implements the bitmap cache. [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ [3] https://github.com/seccomp/libseccomp/issues/116 [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 [6] Draco: Architectural and Operating System Support for System Call Security https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 YiFei Zhu (2): seccomp/cache: Add "emulator" to check if filter is arg-dependent seccomp/cache: Cache filter results that allow syscalls arch/x86/Kconfig | 27 +++ include/linux/seccomp.h | 22 +++ kernel/seccomp.c | 400 +++++++++++++++++++++++++++++++++++++++- 3 files changed, 446 insertions(+), 3 deletions(-) -- 2.28.0 ^ permalink raw reply [flat|nested] 149+ messages in thread
* [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-21 5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu @ 2020-09-21 5:35 ` YiFei Zhu 2020-09-21 17:47 ` Jann Horn 2020-09-21 5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu ` (7 subsequent siblings) 8 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-21 5:35 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg From: YiFei Zhu <yifeifz2@illinois.edu> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not access any syscall arguments or instruction pointer. To facilitate this we need a static analyser to know whether a filter will access. This is implemented here with a pseudo-emulator, and stored in a per-filter bitmap. Each seccomp cBPF instruction, aside from ALU (which should rarely be used in seccomp), gets a naive best-effort emulation for each syscall number. The emulator works by following all possible (without SAT solving) paths the filter can take. Every cBPF register / memory position records whether that is a constant, and of so, the value of the constant. Loading from struct seccomp_data is considered constant if it is a syscall number, else it is an unknown. For each conditional jump, if the both arguments can be resolved to a constant, the jump is followed after computing the result of the condition; else both directions are followed, by pushing one of the next states to a linked list of next states to process. We keep a finite number of pending states to process. The emulation is halted if it reaches a return, or if it reaches a read from struct seccomp_data that reads an offset that is neither syscall number or architecture number. In the latter case, we mark the syscall number as not okay for seccomp to cache. If a filter depends on more filters, then if its dependee cannot process the syscall then the depender is also marked not to process the syscall. We also do a single pass on the entire filter instructions before performing emulation. If none of the filter instructions load from the troublesome offsets, then the filter is considered "trivial", and all syscalls are marked okay for seccomp to cache. Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/x86/Kconfig | 27 ++++ kernel/seccomp.c | 323 ++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 349 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7101ac64bb20..9e6891812053 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1984,6 +1984,33 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +choice + prompt "Seccomp filter cache" + default SECCOMP_CACHE_NONE + depends on SECCOMP + depends on SECCOMP_FILTER + help + Seccomp filters can potentially incur large overhead for each + system call. This can alleviate some of the overhead. + + If in doubt, select 'none'. + +config SECCOMP_CACHE_NONE + bool "None" + help + No caching is done. Seccomp filters will be called each time + a system call occurs in a seccomp-guarded task. + +config SECCOMP_CACHE_NR_ONLY + bool "Syscall number only" + help + This is enables a bitmap to cache the results of seccomp + filters, if the filter allows the syscall and is independent + of the syscall arguments. This requires around 60 bytes per + filter and 70 bytes per task. + +endchoice + source "kernel/Kconfig.hz" config KEXEC diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 3ee59ce0a323..d8c30901face 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -143,6 +143,27 @@ struct notification { struct list_head notifications; }; +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * struct seccomp_cache_filter_data - container for cache's per-filter data + * + * @syscall_ok: A bitmap where each bit represent whether seccomp is allowed to + * cache the results of this syscall. + */ +struct seccomp_cache_filter_data { + DECLARE_BITMAP(syscall_ok, NR_syscalls); +}; + +#define SECCOMP_EMU_MAX_PENDING_STATES 64 +#else +struct seccomp_cache_filter_data { }; + +static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ + return 0; +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * struct seccomp_filter - container for seccomp BPF programs * @@ -185,6 +206,7 @@ struct seccomp_filter { struct notification *notif; struct mutex notify_lock; wait_queue_head_t wqh; + struct seccomp_cache_filter_data cache; }; /* Limit any path through the tree to 256KB worth of instructions. */ @@ -530,6 +552,297 @@ static inline void seccomp_sync_threads(unsigned long flags) } } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * struct seccomp_emu_env - container for seccomp emulator environment + * + * @filter: The cBPF filter instructions. + * @next_state: The next pending state to start emulating from. + * @next_state_len: Length of the next state linked list. This is used to + * enforce naximum number of pending states. + * @nr: The syscall number we are emulating. + * @syscall_ok: Emulation result, whether it is okay for seccomp to cache the + * syscall. + */ +struct seccomp_emu_env { + struct sock_filter *filter; + struct seccomp_emu_state *next_state; + int next_state_len; + int nr; + bool syscall_ok; +}; + +/** + * struct seccomp_emu_state - container for seccomp emulator state + * + * @next: The next pending state. This structure is a linked list. + * @pc: The current program counter. + * @reg_known: Whether each cBPF register / memory location is a constant. + * @reg_const: When a cBPF register / memory location is a constant, the value + * of that constant. + */ +struct seccomp_emu_state { + struct seccomp_emu_state *next; + int pc; + bool reg_known[2 + BPF_MEMWORDS]; + u32 reg_const[2 + BPF_MEMWORDS]; +}; + +/** + * seccomp_emu_step - step one instruction in the emulator + * @env: The emulator environment + * @state: The emulator state + * + * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred. + */ +static int seccomp_emu_step(struct seccomp_emu_env *env, + struct seccomp_emu_state *state) +{ + struct sock_filter *ftest = &env->filter[state->pc++]; + struct seccomp_emu_state *new_state; + u16 code = ftest->code; + u32 k = ftest->k; + u32 operand; + bool compare; + int reg_idx; + + switch (BPF_CLASS(code)) { + case BPF_LD: + case BPF_LDX: + reg_idx = BPF_CLASS(code) == BPF_LDX; + + switch (BPF_MODE(code)) { + case BPF_IMM: + state->reg_known[reg_idx] = true; + state->reg_const[reg_idx] = k; + break; + case BPF_ABS: + if (k == offsetof(struct seccomp_data, nr)) { + state->reg_known[reg_idx] = true; + state->reg_const[reg_idx] = env->nr; + } else { + state->reg_known[reg_idx] = false; + + if (k != offsetof(struct seccomp_data, arch)) { + env->syscall_ok = false; + return 1; + } + } + + break; + case BPF_MEM: + state->reg_known[reg_idx] = state->reg_known[2 + k]; + state->reg_const[reg_idx] = state->reg_const[2 + k]; + break; + default: + state->reg_known[reg_idx] = false; + } + + return 0; + case BPF_ST: + case BPF_STX: + reg_idx = BPF_CLASS(code) == BPF_STX; + + state->reg_known[2 + k] = state->reg_known[reg_idx]; + state->reg_const[2 + k] = state->reg_const[reg_idx]; + + return 0; + case BPF_ALU: + state->reg_known[0] = false; + return 0; + case BPF_JMP: + if (BPF_OP(code) == BPF_JA) { + state->pc += k; + return 0; + } + + if (ftest->jt == ftest->jf) { + state->pc += ftest->jt; + return 0; + } + + if (!state->reg_known[0]) + goto both_cases; + + switch (BPF_SRC(code)) { + case BPF_K: + operand = k; + break; + case BPF_X: + if (!state->reg_known[1]) + goto both_cases; + operand = state->reg_const[1]; + break; + default: + WARN_ON(true); + return -EINVAL; + } + + switch (BPF_OP(code)) { + case BPF_JEQ: + compare = state->reg_const[0] == operand; + break; + case BPF_JGT: + compare = state->reg_const[0] > operand; + break; + case BPF_JGE: + compare = state->reg_const[0] >= operand; + break; + case BPF_JSET: + compare = state->reg_const[0] & operand; + break; + default: + WARN_ON(true); + return -EINVAL; + } + + state->pc += compare ? ftest->jt : ftest->jf; + + return 0; + +both_cases: + if (env->next_state_len >= SECCOMP_EMU_MAX_PENDING_STATES) + return -E2BIG; + + new_state = kmalloc(sizeof(*new_state), GFP_KERNEL); + if (!new_state) + return -ENOMEM; + + *new_state = *state; + new_state->next = env->next_state; + new_state->pc += ftest->jt; + env->next_state = new_state; + env->next_state_len++; + + state->pc += ftest->jf; + + return 0; + case BPF_RET: + return 1; + case BPF_MISC: + switch (BPF_MISCOP(code)) { + case BPF_TAX: + state->reg_known[1] = state->reg_known[0]; + state->reg_const[1] = state->reg_const[0]; + break; + case BPF_TXA: + state->reg_known[0] = state->reg_known[1]; + state->reg_const[0] = state->reg_const[1]; + break; + default: + WARN_ON(true); + return -EINVAL; + } + + return 0; + default: + BUILD_BUG(); + unreachable(); + } +} + +/** + * seccomp_cache_filter_trivial - check if the program does not load arguments. + * @fprog: The cBPF program code + * + * Returns true if the filter is trivial. + */ +static bool seccomp_cache_filter_trivial(struct sock_fprog_kern *fprog) +{ + int pc; + + for (pc = 0; pc < fprog->len; pc++) { + struct sock_filter *ftest = &fprog->filter[pc]; + u16 code = ftest->code; + u32 k = ftest->k; + + if (BPF_CLASS(code) == BPF_LD && BPF_MODE(code) == BPF_ABS) { + if (k != offsetof(struct seccomp_data, nr) && + k != offsetof(struct seccomp_data, arch)) + return false; + } + } + + return true; +} + +/** + * seccomp_cache_prepare - emulate the filter to find cachable syscalls + * @sfilter: The seccomp filter + * + * Returns 0 if successful or -errno if error occurred. + */ +static int seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; + struct seccomp_filter *prev = sfilter->prev; + struct sock_filter *filter = fprog->filter; + struct seccomp_emu_state *state; + int nr, res = 0; + + if (seccomp_cache_filter_trivial(fprog)) { + if (prev) + bitmap_copy(sfilter->cache.syscall_ok, + prev->cache.syscall_ok, NR_syscalls); + else + bitmap_fill(sfilter->cache.syscall_ok, NR_syscalls); + + return 0; + } + + for (nr = 0; nr < NR_syscalls; nr++) { + struct seccomp_emu_env env = {0}; + + env.syscall_ok = !prev || test_bit(nr, prev->cache.syscall_ok); + if (!env.syscall_ok) + continue; + + env.filter = filter; + env.nr = nr; + + env.next_state = kzalloc(sizeof(*env.next_state), GFP_KERNEL); + env.next_state_len = 1; + if (!env.next_state) + return -ENOMEM; + + while (env.next_state) { + state = env.next_state; + env.next_state = state->next; + env.next_state_len--; + + while (true) { + res = seccomp_emu_step(&env, state); + + if (res) + break; + } + + kfree(state); + + if (res < 0) + goto free_states; + } + +free_states: + while (env.next_state) { + state = env.next_state; + env.next_state = state->next; + + kfree(state); + } + + if (res < 0) + goto out; + + if (env.syscall_ok) + set_bit(nr, sfilter->cache.syscall_ok); + } + +out: + return res; +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_prepare_filter: Prepares a seccomp filter for use. * @fprog: BPF program to install @@ -540,7 +853,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) { struct seccomp_filter *sfilter; int ret; - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) return ERR_PTR(-EINVAL); @@ -571,6 +885,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) return ERR_PTR(ret); } + ret = seccomp_cache_prepare(sfilter); + if (ret < 0) { + bpf_prog_destroy(sfilter->prog); + kfree(sfilter); + return ERR_PTR(ret); + } + refcount_set(&sfilter->refs, 1); refcount_set(&sfilter->users, 1); init_waitqueue_head(&sfilter->wqh); -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-21 5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu @ 2020-09-21 17:47 ` Jann Horn 2020-09-21 18:38 ` Jann Horn 2020-09-21 23:44 ` YiFei Zhu 0 siblings, 2 replies; 149+ messages in thread From: Jann Horn @ 2020-09-21 17:47 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Jann Horn, Aleksa Sarai, kernel list On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not > access any syscall arguments or instruction pointer. To facilitate > this we need a static analyser to know whether a filter will > access. This is implemented here with a pseudo-emulator, and > stored in a per-filter bitmap. Each seccomp cBPF instruction, > aside from ALU (which should rarely be used in seccomp), gets a > naive best-effort emulation for each syscall number. > > The emulator works by following all possible (without SAT solving) > paths the filter can take. Every cBPF register / memory position > records whether that is a constant, and of so, the value of the > constant. Loading from struct seccomp_data is considered constant > if it is a syscall number, else it is an unknown. For each > conditional jump, if the both arguments can be resolved to a > constant, the jump is followed after computing the result of the > condition; else both directions are followed, by pushing one of > the next states to a linked list of next states to process. We > keep a finite number of pending states to process. Is this actually necessary, or can we just bail out on any branch that we can't statically resolve? struct seccomp_data only contains the syscall number (constant for a given filter evaluation), the architecture number (also constant), the instruction pointer (basically never used in seccomp filters), and the syscall arguments. Any normal seccomp filter first branches on the architecture, then branches on the syscall number, and then branches on arguments if necessary. This optimization could only be improved by the "follow both branches" logic if a seccomp program branches on either the instruction pointer or an argument *before* looking at the syscall number, and later comes to the same conclusion on *both* sides of the check. It would have to be something like: if (instruction_pointer == 0xasdf1234) { if (nr == mmap) return ACCEPT; [...] return KILL; } else { if (nr == mmap) return ACCEPT; [...] return KILL; } I've never seen anyone do something like this. And the proposed patch would still bail out on such a filter because of the load from the instruction_pointer field; I don't think it would even be possible to reach a branch with an unknown condition with this patch. So I think we should probably get rid of this extra logic for keeping track of multiple execution states for now. That would make the code a lot simpler. Also: If it turns out that the time spent in seccomp_cache_prepare() is measurable for large filters, a possible improvement would be to keep track of the last syscall number for which the result would be the same as for the current one, such that instead of evaluating the filter for one instruction at a time, it would effectively be evaluated for a range at a time. That should be pretty straightforward to implement, I think. > The emulation is halted if it reaches a return, or if it reaches a > read from struct seccomp_data that reads an offset that is neither > syscall number or architecture number. In the latter case, we mark > the syscall number as not okay for seccomp to cache. If a filter > depends on more filters, then if its dependee cannot process the > syscall then the depender is also marked not to process the syscall. > > We also do a single pass on the entire filter instructions before > performing emulation. If none of the filter instructions load from > the troublesome offsets, then the filter is considered "trivial", > and all syscalls are marked okay for seccomp to cache. > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > --- > arch/x86/Kconfig | 27 ++++ > kernel/seccomp.c | 323 ++++++++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 349 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig [...] > +choice > + prompt "Seccomp filter cache" > + default SECCOMP_CACHE_NONE I think this should be on by default. > + depends on SECCOMP > + depends on SECCOMP_FILTER SECCOMP_FILTER already depends on SECCOMP, so the "depends on SECCOMP" line is unnecessary. > + help > + Seccomp filters can potentially incur large overhead for each > + system call. This can alleviate some of the overhead. > + > + If in doubt, select 'none'. This should not be in arch/x86. Other architectures, such as arm64, should also be able to use this without extra work. > +config SECCOMP_CACHE_NONE > + bool "None" > + help > + No caching is done. Seccomp filters will be called each time > + a system call occurs in a seccomp-guarded task. > + > +config SECCOMP_CACHE_NR_ONLY > + bool "Syscall number only" > + help > + This is enables a bitmap to cache the results of seccomp > + filters, if the filter allows the syscall and is independent > + of the syscall arguments. Maybe reword this as something like: "For each syscall number, if the seccomp filter has a fixed result, store that result in a bitmap to speed up system calls." > This requires around 60 bytes per > + filter and 70 bytes per task. > + > +endchoice > + > source "kernel/Kconfig.hz" > > config KEXEC > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 3ee59ce0a323..d8c30901face 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -143,6 +143,27 @@ struct notification { > struct list_head notifications; > }; > > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +/** > + * struct seccomp_cache_filter_data - container for cache's per-filter data > + * > + * @syscall_ok: A bitmap where each bit represent whether seccomp is allowed to nit: represents > + * cache the results of this syscall. > + */ > +struct seccomp_cache_filter_data { > + DECLARE_BITMAP(syscall_ok, NR_syscalls); > +}; > + > +#define SECCOMP_EMU_MAX_PENDING_STATES 64 > +#else > +struct seccomp_cache_filter_data { }; > + > +static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter) > +{ > + return 0; > +} > +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ [...] > +/** > + * seccomp_emu_step - step one instruction in the emulator > + * @env: The emulator environment > + * @state: The emulator state > + * > + * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred. > + */ > +static int seccomp_emu_step(struct seccomp_emu_env *env, > + struct seccomp_emu_state *state) > +{ > + struct sock_filter *ftest = &env->filter[state->pc++]; > + struct seccomp_emu_state *new_state; > + u16 code = ftest->code; > + u32 k = ftest->k; > + u32 operand; > + bool compare; > + int reg_idx; > + > + switch (BPF_CLASS(code)) { > + case BPF_LD: > + case BPF_LDX: > + reg_idx = BPF_CLASS(code) == BPF_LDX; > + > + switch (BPF_MODE(code)) { > + case BPF_IMM: > + state->reg_known[reg_idx] = true; > + state->reg_const[reg_idx] = k; > + break; > + case BPF_ABS: > + if (k == offsetof(struct seccomp_data, nr)) { > + state->reg_known[reg_idx] = true; > + state->reg_const[reg_idx] = env->nr; > + } else { > + state->reg_known[reg_idx] = false; This is completely broken. This emulation logic *needs* to run with the proper architecture identifier. (And for platforms like x86-64 that have compatibility support for a second ABI, the emulation should probably also be done for that ABI, and there should be separate bitmasks for that ABI.) With the current logic, you will (almost) never actually have permitted syscalls in the bitmask, because filters fundamentally have to return different results for different ABIs - the syscall numbers mean completely different things under different ABIs. > + if (k != offsetof(struct seccomp_data, arch)) { > + env->syscall_ok = false; > + return 1; > + } > + } This would read nicer as: if (k == offsetof(struct seccomp_data, nr)) { } else if (k == offsetof(struct seccomp_data, arch)) { } else { env->syscall_ok = false; return 1; } > + > + break; > + case BPF_MEM: > + state->reg_known[reg_idx] = state->reg_known[2 + k]; > + state->reg_const[reg_idx] = state->reg_const[2 + k]; > + break; > + default: > + state->reg_known[reg_idx] = false; > + } > + > + return 0; > + case BPF_ST: > + case BPF_STX: > + reg_idx = BPF_CLASS(code) == BPF_STX; > + > + state->reg_known[2 + k] = state->reg_known[reg_idx]; > + state->reg_const[2 + k] = state->reg_const[reg_idx]; I think we should probably just bail out if we see anything that's BPF_ST/BPF_STX. I've never seen seccomp filters that actually use that part of cBPF. But in case we do need this, maybe instead of using "2 +" for all these things, the cBPF memory slots should be in a separate array. > + return 0; > + case BPF_ALU: > + state->reg_known[0] = false; > + return 0; > + case BPF_JMP: > + if (BPF_OP(code) == BPF_JA) { > + state->pc += k; > + return 0; > + } > + > + if (ftest->jt == ftest->jf) { > + state->pc += ftest->jt; > + return 0; > + } Why is this check here? Is anyone actually creating filters with such obviously nonsensical branches? I know that there are highly ludicrous filters out there, but I don't think I've ever seen this specific kind of useless code. > + if (!state->reg_known[0]) > + goto both_cases; [...] > +both_cases: > + if (env->next_state_len >= SECCOMP_EMU_MAX_PENDING_STATES) > + return -E2BIG; Even if we cap the maximum number of pending states, this could still run for an almost unbounded amount of time, I think. Which is bad. If this code was actually necessary, we'd probably want to track separately the total number of branches we've seen and so on. But as I said, I think this code should just be removed instead. [...] > + } > +} [...] ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-21 17:47 ` Jann Horn @ 2020-09-21 18:38 ` Jann Horn 2020-09-21 23:44 ` YiFei Zhu 1 sibling, 0 replies; 149+ messages in thread From: Jann Horn @ 2020-09-21 18:38 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Aleksa Sarai, kernel list On Mon, Sep 21, 2020 at 7:47 PM Jann Horn <jannh@google.com> wrote: > On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not > > access any syscall arguments or instruction pointer. To facilitate > > this we need a static analyser to know whether a filter will > > access. This is implemented here with a pseudo-emulator, and > > stored in a per-filter bitmap. Each seccomp cBPF instruction, > > aside from ALU (which should rarely be used in seccomp), gets a > > naive best-effort emulation for each syscall number. > > > > The emulator works by following all possible (without SAT solving) > > paths the filter can take. Every cBPF register / memory position > > records whether that is a constant, and of so, the value of the > > constant. Loading from struct seccomp_data is considered constant > > if it is a syscall number, else it is an unknown. For each > > conditional jump, if the both arguments can be resolved to a > > constant, the jump is followed after computing the result of the > > condition; else both directions are followed, by pushing one of > > the next states to a linked list of next states to process. We > > keep a finite number of pending states to process. > > Is this actually necessary, or can we just bail out on any branch that > we can't statically resolve? Aaaah, now I get what's going on. You statically compute a bitmask that says whether a given syscall number always has a fixed result *per architecture number*, and then use that later to decide whether results can be cached for the combination of a specific seccomp filter and a specific architecture number. Which mostly works, except that it means you end up with weird per-thread caches and you get interference between ABIs (so if a process e.g. filters the argument numbers for syscall 123 in ABI 1, the results for syscall 123 in ABI 2 also can't be cached). Anyway, even though this works, I think it's the wrong way to go about it. ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-21 17:47 ` Jann Horn 2020-09-21 18:38 ` Jann Horn @ 2020-09-21 23:44 ` YiFei Zhu 2020-09-22 0:25 ` Jann Horn 1 sibling, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-21 23:44 UTC (permalink / raw) To: Jann Horn Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Aleksa Sarai, kernel list On Mon, Sep 21, 2020 at 12:47 PM Jann Horn <jannh@google.com> wrote: > Is this actually necessary, or can we just bail out on any branch that > we can't statically resolve? I think after we do enumerate the arch numbers it would make much more sense. Since if there is a branch after arch number and syscall numbers are fixed we can assume that the return values will be different if one or the other case is followed. > Also: If it turns out that the time spent in seccomp_cache_prepare() > is measurable for large filters, a possible improvement would be to > keep track of the last syscall number for which the result would be > the same as for the current one, such that instead of evaluating the > filter for one instruction at a time, it would effectively be > evaluated for a range at a time. That should be pretty straightforward > to implement, I think. My concern was more of the possibly-exponential amount of time & memory needed to evaluate an adversarial filter containing full of unresolveable branches, hence the max pending states. If we never follow both branches then evaluation should not be much of a concern. > > + depends on SECCOMP > > + depends on SECCOMP_FILTER > > SECCOMP_FILTER already depends on SECCOMP, so the "depends on SECCOMP" > line is unnecessary. The reason that this is here is because of the looks in menuconfig. SECCOMP is the direct previous entry, so if this depends on SECCOMP then the config would be indented. Is this looks not worth keeping or is there some better way to do this? > > + help > > + Seccomp filters can potentially incur large overhead for each > > + system call. This can alleviate some of the overhead. > > + > > + If in doubt, select 'none'. > > This should not be in arch/x86. Other architectures, such as arm64, > should also be able to use this without extra work. In the initial RFC patch I only added to x86. I could add it to any arch that has seccomp filters. Though, I'm wondering, why is SECCOMP in the arch-specific Kconfigs? > I think we should probably just bail out if we see anything that's > BPF_ST/BPF_STX. I've never seen seccomp filters that actually use that > part of cBPF. > > But in case we do need this, maybe instead of using "2 +" for all > these things, the cBPF memory slots should be in a separate array. Ok I'll just bail. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-21 23:44 ` YiFei Zhu @ 2020-09-22 0:25 ` Jann Horn 2020-09-22 0:47 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Jann Horn @ 2020-09-22 0:25 UTC (permalink / raw) To: YiFei Zhu, Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Aleksa Sarai, kernel list On Tue, Sep 22, 2020 at 1:44 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > On Mon, Sep 21, 2020 at 12:47 PM Jann Horn <jannh@google.com> wrote: > > > + depends on SECCOMP > > > + depends on SECCOMP_FILTER > > > > SECCOMP_FILTER already depends on SECCOMP, so the "depends on SECCOMP" > > line is unnecessary. > > The reason that this is here is because of the looks in menuconfig. > SECCOMP is the direct previous entry, so if this depends on SECCOMP > then the config would be indented. Is this looks not worth keeping or > is there some better way to do this? Ah, I didn't realize this. > > > + help > > > + Seccomp filters can potentially incur large overhead for each > > > + system call. This can alleviate some of the overhead. > > > + > > > + If in doubt, select 'none'. > > > > This should not be in arch/x86. Other architectures, such as arm64, > > should also be able to use this without extra work. > > In the initial RFC patch I only added to x86. I could add it to any > arch that has seccomp filters. Though, I'm wondering, why is SECCOMP > in the arch-specific Kconfigs? Ugh, yeah, the existing code is already bad... as far as I can tell, SECCOMP shouldn't be there, and instead the arch-specific Kconfig should define something like HAVE_ARCH_SECCOMP and then arch/Kconfig would define SECCOMP and let it depend on HAVE_ARCH_SECCOMP. It's really gross how the SECCOMP config description has been copypasted into a dozen different Kconfig files; and looking around a bit, you can actually see that e.g. s390 has an utterly outdated help text which still claims that seccomp is controlled via the ancient "/proc/<pid>/seccomp". I guess this very nicely illustrates why putting such options into arch-specific Kconfig is a bad idea. :P ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-22 0:25 ` Jann Horn @ 2020-09-22 0:47 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-22 0:47 UTC (permalink / raw) To: Jann Horn Cc: Kees Cook, Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Aleksa Sarai, kernel list On Mon, Sep 21, 2020 at 7:26 PM Jann Horn <jannh@google.com> wrote: > > In the initial RFC patch I only added to x86. I could add it to any > > arch that has seccomp filters. Though, I'm wondering, why is SECCOMP > > in the arch-specific Kconfigs? > > Ugh, yeah, the existing code is already bad... as far as I can tell, > SECCOMP shouldn't be there, and instead the arch-specific Kconfig > should define something like HAVE_ARCH_SECCOMP and then arch/Kconfig > would define SECCOMP and let it depend on HAVE_ARCH_SECCOMP. It's > really gross how the SECCOMP config description has been copypasted > into a dozen different Kconfig files; and looking around a bit, you > can actually see that e.g. s390 has an utterly outdated help text > which still claims that seccomp is controlled via the ancient > "/proc/<pid>/seccomp". I guess this very nicely illustrates why > putting such options into arch-specific Kconfig is a bad idea. :P Ah, time to fix this then. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls 2020-09-21 5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu 2020-09-21 5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu @ 2020-09-21 5:35 ` YiFei Zhu 2020-09-21 18:08 ` Jann Horn 2020-09-25 0:01 ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook 2020-09-21 5:48 ` [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls Sargun Dhillon ` (6 subsequent siblings) 8 siblings, 2 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-21 5:35 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg From: YiFei Zhu <yifeifz2@illinois.edu> The fast (common) path for seccomp should be that the filter permits the syscall to pass through, and failing seccomp is expected to be an exceptional case; it is not expected for userspace to call a denylisted syscall over and over. We do this by creating a per-task bitmap of permitted syscalls. If seccomp filter is invoked we check if it is cached and if so directly return allow. Else we call into the cBPF filter, and if the result is an allow then we cache the results. The cache is per-task to minimize thread-synchronization issues in the hot path of cache lookup, and to avoid different architecture numbers sharing the same cache. To account for one thread changing the filter for another thread of the same process, the per-task struct also contains a pointer to the filter the cache is built on. When the cache lookup uses a different filter then the last lookup, the per-task cache bitmap is cleared. Architecture number changes also invokes a clear of the per-task cache, since it should be very unlikely for a given thread to change its architecture. Benchmark results, on qemu-kvm x86_64 VM, on Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz, with seccomp_benchmark: With SECCOMP_CACHE_NONE: Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Calibrating sample size for 15 seconds worth of syscalls ... Benchmarking 23486415 syscalls... 16.079642020 - 1.013345439 = 15066296581 (15.1s) getpid native: 641 ns 32.080237410 - 16.080763500 = 15999473910 (16.0s) getpid RET_ALLOW 1 filter: 681 ns 48.609461618 - 32.081296173 = 16528165445 (16.5s) getpid RET_ALLOW 2 filters: 703 ns Estimated total seccomp overhead for 1 filter: 40 ns Estimated total seccomp overhead for 2 filters: 62 ns Estimated seccomp per-filter overhead: 22 ns Estimated seccomp entry overhead: 18 ns With SECCOMP_CACHE_NR_ONLY: Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Calibrating sample size for 15 seconds worth of syscalls ... Benchmarking 23486415 syscalls... 16.059512499 - 1.014108434 = 15045404065 (15.0s) getpid native: 640 ns 31.651075934 - 16.060637323 = 15590438611 (15.6s) getpid RET_ALLOW 1 filter: 663 ns 47.367316169 - 31.652302661 = 15715013508 (15.7s) getpid RET_ALLOW 2 filters: 669 ns Estimated total seccomp overhead for 1 filter: 23 ns Estimated total seccomp overhead for 2 filters: 29 ns Estimated seccomp per-filter overhead: 6 ns Estimated seccomp entry overhead: 17 ns Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- include/linux/seccomp.h | 22 ++++++++++++ kernel/seccomp.c | 77 +++++++++++++++++++++++++++++++++++++++-- 2 files changed, 97 insertions(+), 2 deletions(-) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 02aef2844c38..08ec8b90c99d 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -21,6 +21,27 @@ #include <asm/seccomp.h> struct seccomp_filter; + +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * struct seccomp_cache_task_data - container for seccomp cache's per-task data + * + * @syscall_ok: A bitmap where each bit represents whether the syscall is cached + * and that the filter allowed it. + * @last_filter: If the next cache lookup uses a different filter, the lookup + * will clear cache. + * @last_arch: If the next cache lookup uses a different arch number, the + * lookup will clear cache. + */ +struct seccomp_cache_task_data { + DECLARE_BITMAP(syscall_ok, NR_syscalls); + const struct seccomp_filter *last_filter; + u32 last_arch; +}; +#else +struct seccomp_cache_task_data { }; +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * struct seccomp - the state of a seccomp'ed process * @@ -36,6 +57,7 @@ struct seccomp { int mode; atomic_t filter_count; struct seccomp_filter *filter; + struct seccomp_cache_task_data cache; }; #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER diff --git a/kernel/seccomp.c b/kernel/seccomp.c index d8c30901face..7096f8c86f71 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -162,6 +162,17 @@ static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter) { return 0; } + +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + return 0; +} + +static inline void seccomp_cache_insert(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ +} #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ /** @@ -316,6 +327,59 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) return 0; } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * seccomp_cache_check - lookup seccomp cache + * @sfilter: The seccomp filter + * @sd: The seccomp data to lookup the cache with + * + * Returns true if the seccomp_data is cached and allowed. + */ +static bool seccomp_cache_check(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + struct seccomp_cache_task_data *thread_data; + int syscall_nr = sd->nr; + + if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls)) + return false; + + thread_data = ¤t->seccomp.cache; + if (unlikely(thread_data->last_filter != sfilter || + thread_data->last_arch != sd->arch)) { + thread_data->last_filter = sfilter; + thread_data->last_arch = sd->arch; + + bitmap_zero(thread_data->syscall_ok, NR_syscalls); + return false; + } + + return test_bit(syscall_nr, thread_data->syscall_ok); +} + +/** + * seccomp_cache_insert - insert into seccomp cache + * @sfilter: The seccomp filter + * @sd: The seccomp data to insert into the cache + */ +static void seccomp_cache_insert(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + struct seccomp_cache_task_data *thread_data; + int syscall_nr = sd->nr; + + if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls)) + return; + + thread_data = ¤t->seccomp.cache; + + if (!test_bit(syscall_nr, sfilter->cache.syscall_ok)) + return; + + set_bit(syscall_nr, thread_data->syscall_ok); +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_run_filters - evaluates all seccomp filters against @sd * @sd: optional seccomp data to be passed to filters @@ -331,13 +395,18 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, { u32 ret = SECCOMP_RET_ALLOW; /* Make sure cross-thread synced filter points somewhere sane. */ - struct seccomp_filter *f = - READ_ONCE(current->seccomp.filter); + struct seccomp_filter *f, *f_head; + + f = READ_ONCE(current->seccomp.filter); + f_head = f; /* Ensure unexpected behavior doesn't result in failing open. */ if (WARN_ON(f == NULL)) return SECCOMP_RET_KILL_PROCESS; + if (seccomp_cache_check(f_head, sd)) + return SECCOMP_RET_ALLOW; + /* * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). @@ -350,6 +419,10 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, *match = f; } } + + if (ret == SECCOMP_RET_ALLOW) + seccomp_cache_insert(f_head, sd); + return ret; } #endif /* CONFIG_SECCOMP_FILTER */ -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls 2020-09-21 5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu @ 2020-09-21 18:08 ` Jann Horn 2020-09-21 22:50 ` YiFei Zhu 2020-09-25 0:01 ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook 1 sibling, 1 reply; 149+ messages in thread From: Jann Horn @ 2020-09-21 18:08 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Aleksa Sarai, kernel list On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: [...] > We do this by creating a per-task bitmap of permitted syscalls. > If seccomp filter is invoked we check if it is cached and if so > directly return allow. Else we call into the cBPF filter, and if > the result is an allow then we cache the results. What? Why? We already have code to statically evaluate the filter for all syscall numbers. We should be using the results of that instead of re-running the filter and separately caching the results. > The cache is per-task Please don't. The static results are per-filter, so the bitmask(s) should also be per-filter and immutable. > minimize thread-synchronization issues in > the hot path of cache lookup There should be no need for synchronization because those bitmasks should be immutable. > and to avoid different architecture > numbers sharing the same cache. There should be separate caches for separate architectures, and we should precompute the results for all architectures. (We only have around 2 different architectures max, so it's completely reasonable to precompute and store all that.) > To account for one thread changing the filter for another thread of > the same process, the per-task struct also contains a pointer to > the filter the cache is built on. When the cache lookup uses a > different filter then the last lookup, the per-task cache bitmap is > cleared. Unnecessary complexity, we don't need that if we make the bitmasks immutable. ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls 2020-09-21 18:08 ` Jann Horn @ 2020-09-21 22:50 ` YiFei Zhu 2020-09-21 22:57 ` Jann Horn 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-21 22:50 UTC (permalink / raw) To: Jann Horn Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Aleksa Sarai, kernel list On Mon, Sep 21, 2020 at 1:09 PM Jann Horn <jannh@google.com> wrote: > > On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > [...] > > We do this by creating a per-task bitmap of permitted syscalls. > > If seccomp filter is invoked we check if it is cached and if so > > directly return allow. Else we call into the cBPF filter, and if > > the result is an allow then we cache the results. > > What? Why? We already have code to statically evaluate the filter for > all syscall numbers. We should be using the results of that instead of > re-running the filter and separately caching the results. > > > The cache is per-task > > Please don't. The static results are per-filter, so the bitmask(s) > should also be per-filter and immutable. I do agree that an immutable bitmask is faster and easier to reason about its correctness. However, I did not find the "code to statically evaluate the filter for all syscall numbers" while reading seccomp.c. Would you give me a pointer to that and I will see how to best make use of it? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls 2020-09-21 22:50 ` YiFei Zhu @ 2020-09-21 22:57 ` Jann Horn 2020-09-21 23:08 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Jann Horn @ 2020-09-21 22:57 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Aleksa Sarai, kernel list On Tue, Sep 22, 2020 at 12:51 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > On Mon, Sep 21, 2020 at 1:09 PM Jann Horn <jannh@google.com> wrote: > > > > On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > [...] > > > We do this by creating a per-task bitmap of permitted syscalls. > > > If seccomp filter is invoked we check if it is cached and if so > > > directly return allow. Else we call into the cBPF filter, and if > > > the result is an allow then we cache the results. > > > > What? Why? We already have code to statically evaluate the filter for > > all syscall numbers. We should be using the results of that instead of > > re-running the filter and separately caching the results. > > > > > The cache is per-task > > > > Please don't. The static results are per-filter, so the bitmask(s) > > should also be per-filter and immutable. > > I do agree that an immutable bitmask is faster and easier to reason > about its correctness. However, I did not find the "code to statically > evaluate the filter for all syscall numbers" while reading seccomp.c. > Would you give me a pointer to that and I will see how to best make > use of it? I'm talking about the code you're adding in the other patch ("[RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent"). Sorry, that was a bit unclear. ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls 2020-09-21 22:57 ` Jann Horn @ 2020-09-21 23:08 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-21 23:08 UTC (permalink / raw) To: Jann Horn Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Aleksa Sarai, kernel list On Mon, Sep 21, 2020 at 5:58 PM Jann Horn <jannh@google.com> wrote: > > I do agree that an immutable bitmask is faster and easier to reason > > about its correctness. However, I did not find the "code to statically > > evaluate the filter for all syscall numbers" while reading seccomp.c. > > Would you give me a pointer to that and I will see how to best make > > use of it? > > I'm talking about the code you're adding in the other patch ("[RFC > PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is > arg-dependent"). Sorry, that was a bit unclear. I see, building an immutable accept bitmask when preparing and then just use that when running it. I guess if the arch number issue is resolved this should be more doable. Will do. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-21 5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu 2020-09-21 18:08 ` Jann Horn @ 2020-09-25 0:01 ` Kees Cook 2020-09-25 0:15 ` Jann Horn 2020-09-25 1:27 ` YiFei Zhu 1 sibling, 2 replies; 149+ messages in thread From: Kees Cook @ 2020-09-25 0:01 UTC (permalink / raw) To: YiFei Zhu Cc: YiFei Zhu, containers, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry [resend, argh, I didn't reply-all, sorry for the noise] On Thu, Sep 24, 2020 at 07:44:17AM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > Seccomp cache emulator needs to know all the architecture numbers > that syscall_get_arch() could return for the kernel build in order > to generate a cache for all of them. > > The array is declared in header as static __maybe_unused const > to maximize compiler optimiation opportunities such as loop > unrolling. Disregarding the "how" of this, yeah, we'll certainly need something to tell seccomp about the arrangement of syscall tables and how to find them. However, I'd still prefer to do this on a per-arch basis, and include more detail, as I've got in my v1. Something missing from both styles, though, is a consolidation of values, where the AUDIT_ARCH* isn't reused in both the seccomp info and the syscall_get_arch() return. The problems here were two-fold: 1) putting this in syscall.h meant you do not have full NR_syscall* visibility on some architectures (e.g. arm64 plays weird games with header include order). 2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros haven't removed CONFIG_X86_X32 widely yet, so it is a reality that it must be dealt with), which means seccomp's idea of the arch "number" can't be the same as the AUDIT_ARCH. So, likely a combo of approaches is needed: an array (or more likely, enum), declared in the per-arch seccomp.h file. And I don't see a way to solve #1 cleanly. Regardless, it needs to be split per architecture so that regressions can be bisected/reverted/isolated cleanly. And if we can't actually test it at runtime (or find someone who can) it's not a good idea to make the change. :) > [...] > diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h > index 7cbf733d11af..e13bb2a65b6f 100644 > --- a/arch/x86/include/asm/syscall.h > +++ b/arch/x86/include/asm/syscall.h > @@ -97,6 +97,10 @@ static inline void syscall_set_arguments(struct task_struct *task, > memcpy(®s->bx + i, args, n * sizeof(args[0])); > } > > +static __maybe_unused const int syscall_arches[] = { > + AUDIT_ARCH_I386 > +}; > + > static inline int syscall_get_arch(struct task_struct *task) > { > return AUDIT_ARCH_I386; > @@ -152,6 +156,13 @@ static inline void syscall_set_arguments(struct task_struct *task, > } > } > > +static __maybe_unused const int syscall_arches[] = { > + AUDIT_ARCH_X86_64, > +#ifdef CONFIG_IA32_EMULATION > + AUDIT_ARCH_I386, > +#endif > +}; I'm leaving this section quoted because I'll refer to it in a later patch review... -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-25 0:01 ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook @ 2020-09-25 0:15 ` Jann Horn 2020-09-25 0:18 ` Al Viro 2020-09-25 1:27 ` YiFei Zhu 1 sibling, 1 reply; 149+ messages in thread From: Jann Horn @ 2020-09-25 0:15 UTC (permalink / raw) To: Kees Cook Cc: YiFei Zhu, YiFei Zhu, Linux Containers, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Sep 25, 2020 at 2:01 AM Kees Cook <keescook@chromium.org> wrote: > 2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros > haven't removed CONFIG_X86_X32 widely yet, so it is a reality that > it must be dealt with), which means seccomp's idea of the arch > "number" can't be the same as the AUDIT_ARCH. Sure, distros ship it; but basically nobody uses it, it doesn't have to be fast. As long as we don't *break* it, everything's fine. And if we ignore the existence of X32 in the fastpath, that'll just mean that syscalls with the X32 marker bit always hit the seccomp slowpath (because it'll look like the syscall number is out-of-bounds ) - no problem. ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-25 0:15 ` Jann Horn @ 2020-09-25 0:18 ` Al Viro 2020-09-25 0:24 ` Jann Horn 0 siblings, 1 reply; 149+ messages in thread From: Al Viro @ 2020-09-25 0:18 UTC (permalink / raw) To: Jann Horn Cc: Kees Cook, YiFei Zhu, YiFei Zhu, Linux Containers, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Sep 25, 2020 at 02:15:50AM +0200, Jann Horn wrote: > On Fri, Sep 25, 2020 at 2:01 AM Kees Cook <keescook@chromium.org> wrote: > > 2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros > > haven't removed CONFIG_X86_X32 widely yet, so it is a reality that > > it must be dealt with), which means seccomp's idea of the arch > > "number" can't be the same as the AUDIT_ARCH. > > Sure, distros ship it; but basically nobody uses it, it doesn't have > to be fast. As long as we don't *break* it, everything's fine. And if > we ignore the existence of X32 in the fastpath, that'll just mean that > syscalls with the X32 marker bit always hit the seccomp slowpath > (because it'll look like the syscall number is out-of-bounds ) - no > problem. You do realize that X32 is amd64 counterpart of mips n32, right? And that's not "basically nobody uses it"... ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-25 0:18 ` Al Viro @ 2020-09-25 0:24 ` Jann Horn 0 siblings, 0 replies; 149+ messages in thread From: Jann Horn @ 2020-09-25 0:24 UTC (permalink / raw) To: Al Viro Cc: Kees Cook, YiFei Zhu, YiFei Zhu, Linux Containers, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Sep 25, 2020 at 2:18 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Sep 25, 2020 at 02:15:50AM +0200, Jann Horn wrote: > > On Fri, Sep 25, 2020 at 2:01 AM Kees Cook <keescook@chromium.org> wrote: > > > 2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros > > > haven't removed CONFIG_X86_X32 widely yet, so it is a reality that > > > it must be dealt with), which means seccomp's idea of the arch > > > "number" can't be the same as the AUDIT_ARCH. > > > > Sure, distros ship it; but basically nobody uses it, it doesn't have > > to be fast. As long as we don't *break* it, everything's fine. And if > > we ignore the existence of X32 in the fastpath, that'll just mean that > > syscalls with the X32 marker bit always hit the seccomp slowpath > > (because it'll look like the syscall number is out-of-bounds ) - no > > problem. > > You do realize that X32 is amd64 counterpart of mips n32, right? And that's > not "basically nobody uses it"... What makes X32 weird for seccomp is that it has the syscall tables for X86-64 and X32 mushed together, using the single architecture identifier AUDIT_ARCH_X86_64. I believe that's what Kees referred to by "multiplexed tables". As far as I can tell, MIPS is more well-behaved there and uses the separate architecture identifiers AUDIT_ARCH_MIPS|__AUDIT_ARCH_64BIT and AUDIT_ARCH_MIPS|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_CONVENTION_MIPS64_N32. (But no, I did not actually realize that that's what N32 is. Thanks for the explanation, I was wondering why MIPS was the only architecture with three architecture identifiers...) ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-25 0:01 ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook 2020-09-25 0:15 ` Jann Horn @ 2020-09-25 1:27 ` YiFei Zhu 2020-09-25 3:09 ` Kees Cook 1 sibling, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-25 1:27 UTC (permalink / raw) To: Kees Cook Cc: YiFei Zhu, Linux Containers, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry [resending this too] On Thu, Sep 24, 2020 at 6:01 PM Kees Cook <keescook@chromium.org> wrote: > Disregarding the "how" of this, yeah, we'll certainly need something to > tell seccomp about the arrangement of syscall tables and how to find > them. > > However, I'd still prefer to do this on a per-arch basis, and include > more detail, as I've got in my v1. > > Something missing from both styles, though, is a consolidation of > values, where the AUDIT_ARCH* isn't reused in both the seccomp info and > the syscall_get_arch() return. The problems here were two-fold: > > 1) putting this in syscall.h meant you do not have full NR_syscall* > visibility on some architectures (e.g. arm64 plays weird games with > header include order). I don't get this one -- I'm not playing with NR_syscall here. > 2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros > haven't removed CONFIG_X86_X32 widely yet, so it is a reality that > it must be dealt with), which means seccomp's idea of the arch > "number" can't be the same as the AUDIT_ARCH. Why so? Does anyone actually use x32 in a container? The memory cost and analysis cost is on everyone. The worst case scenario if we don't support it is that the syscall is not accelerated. > So, likely a combo of approaches is needed: an array (or more likely, > enum), declared in the per-arch seccomp.h file. And I don't see a way > to solve #1 cleanly. > > Regardless, it needs to be split per architecture so that regressions > can be bisected/reverted/isolated cleanly. And if we can't actually test > it at runtime (or find someone who can) it's not a good idea to make the > change. :) You have a good point regarding tests. Don't see how it affects regressions though. Only one file here is ever included per-build. > > [...] > > diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h > > index 7cbf733d11af..e13bb2a65b6f 100644 > > --- a/arch/x86/include/asm/syscall.h > > +++ b/arch/x86/include/asm/syscall.h > > @@ -97,6 +97,10 @@ static inline void syscall_set_arguments(struct task_struct *task, > > memcpy(®s->bx + i, args, n * sizeof(args[0])); > > } > > > > +static __maybe_unused const int syscall_arches[] = { > > + AUDIT_ARCH_I386 > > +}; > > + > > static inline int syscall_get_arch(struct task_struct *task) > > { > > return AUDIT_ARCH_I386; > > @@ -152,6 +156,13 @@ static inline void syscall_set_arguments(struct task_struct *task, > > } > > } > > > > +static __maybe_unused const int syscall_arches[] = { > > + AUDIT_ARCH_X86_64, > > +#ifdef CONFIG_IA32_EMULATION > > + AUDIT_ARCH_I386, > > +#endif > > +}; > > I'm leaving this section quoted because I'll refer to it in a later > patch review... > > -- > Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-25 1:27 ` YiFei Zhu @ 2020-09-25 3:09 ` Kees Cook 2020-09-25 3:28 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-25 3:09 UTC (permalink / raw) To: YiFei Zhu Cc: YiFei Zhu, Linux Containers, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 08:27:40PM -0500, YiFei Zhu wrote: > [resending this too] > > On Thu, Sep 24, 2020 at 6:01 PM Kees Cook <keescook@chromium.org> wrote: > > Disregarding the "how" of this, yeah, we'll certainly need something to > > tell seccomp about the arrangement of syscall tables and how to find > > them. > > > > However, I'd still prefer to do this on a per-arch basis, and include > > more detail, as I've got in my v1. > > > > Something missing from both styles, though, is a consolidation of > > values, where the AUDIT_ARCH* isn't reused in both the seccomp info and > > the syscall_get_arch() return. The problems here were two-fold: > > > > 1) putting this in syscall.h meant you do not have full NR_syscall* > > visibility on some architectures (e.g. arm64 plays weird games with > > header include order). > > I don't get this one -- I'm not playing with NR_syscall here. Right, sorry, I may not have been clear. When building my RFC I noticed that I couldn't use NR_syscall very "early" in the header file include stack on arm64, which complicated things. So I guess what I mean is something like "it's probably better to do all these seccomp-specific macros/etc in asm/include/seccomp.h rather than in syscall.h because I know at least one architecture that might cause trouble." > > 2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros > > haven't removed CONFIG_X86_X32 widely yet, so it is a reality that > > it must be dealt with), which means seccomp's idea of the arch > > "number" can't be the same as the AUDIT_ARCH. > > Why so? Does anyone actually use x32 in a container? The memory cost > and analysis cost is on everyone. The worst case scenario if we don't > support it is that the syscall is not accelerated. Ironicailly, that's the only place I actually know for sure where people using x32 because it shows measurable (10%) speed-up for builders: https://lore.kernel.org/lkml/CAOesGMgu1i3p7XMZuCEtj63T-ST_jh+BfaHy-K6LhgqNriKHAA@mail.gmail.com So, yes, as you and Jann both point out, it wouldn't be terrible to just ignore x32, it seems a shame to penalize it. That said, if the masking step from my v1 is actually noticable on a native workload, then yeah, probably x32 should be ignored. My instinct (not measured) is that it's faster than walking a small array.[citation needed] > > So, likely a combo of approaches is needed: an array (or more likely, > > enum), declared in the per-arch seccomp.h file. And I don't see a way > > to solve #1 cleanly. > > > > Regardless, it needs to be split per architecture so that regressions > > can be bisected/reverted/isolated cleanly. And if we can't actually test > > it at runtime (or find someone who can) it's not a good idea to make the > > change. :) > > You have a good point regarding tests. Don't see how it affects > regressions though. Only one file here is ever included per-build. It's easier to do a per-arch revert (i.e. all the -stable tree machinery, etc) with a single SHA instead of having to write a partial revert, etc. -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-25 3:09 ` Kees Cook @ 2020-09-25 3:28 ` YiFei Zhu 2020-09-25 16:39 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-25 3:28 UTC (permalink / raw) To: Kees Cook Cc: YiFei Zhu, Linux Containers, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 10:09 PM Kees Cook <keescook@chromium.org> wrote: > Right, sorry, I may not have been clear. When building my RFC I noticed > that I couldn't use NR_syscall very "early" in the header file include > stack on arm64, which complicated things. So I guess what I mean is > something like "it's probably better to do all these seccomp-specific > macros/etc in asm/include/seccomp.h rather than in syscall.h because I > know at least one architecture that might cause trouble." Ah. Makes sense. > Ironicailly, that's the only place I actually know for sure where people > using x32 because it shows measurable (10%) speed-up for builders: > https://lore.kernel.org/lkml/CAOesGMgu1i3p7XMZuCEtj63T-ST_jh+BfaHy-K6LhgqNriKHAA@mail.gmail.com Wow. 10% is significant. Makes you wonder why x32 hasn't conquered the world. > So, yes, as you and Jann both point out, it wouldn't be terrible to just > ignore x32, it seems a shame to penalize it. That said, if the masking > step from my v1 is actually noticable on a native workload, then yeah, > probably x32 should be ignored. My instinct (not measured) is that it's > faster than walking a small array.[citation needed] My instinct: should be pretty similar, with the loop unrolled. You convince me that penalizing supporting x32 would be a pity :( The 10% is so nice I want it. > It's easier to do a per-arch revert (i.e. all the -stable tree > machinery, etc) with a single SHA instead of having to write a partial > revert, etc. I see. Thanks for clarifying. How about this? Rather than specifically designing names for bitmasks (native, compat, multiplex), just have SECCOMP_ARCH_{1,2,3}? Each arch number would provide the size of the bitmap and a static inline function to check the given seccomp_data belongs to the arch and if so, the order of the bit in the bitmap. There is no need for the shifts and madness in seccomp.c; it's arch-dependent code in their own seccomp.h. We let the preprocessor and compiler to make things optimized. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-25 3:28 ` YiFei Zhu @ 2020-09-25 16:39 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-25 16:39 UTC (permalink / raw) To: Kees Cook Cc: YiFei Zhu, Linux Containers, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 10:28 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > Ah. Makes sense. > > > Ironicailly, that's the only place I actually know for sure where people > > using x32 because it shows measurable (10%) speed-up for builders: > > https://lore.kernel.org/lkml/CAOesGMgu1i3p7XMZuCEtj63T-ST_jh+BfaHy-K6LhgqNriKHAA@mail.gmail.com > > Wow. 10% is significant. Makes you wonder why x32 hasn't conquered the world. > > > So, yes, as you and Jann both point out, it wouldn't be terrible to just > > ignore x32, it seems a shame to penalize it. That said, if the masking > > step from my v1 is actually noticable on a native workload, then yeah, > > probably x32 should be ignored. My instinct (not measured) is that it's > > faster than walking a small array.[citation needed] > > You convince me that penalizing supporting x32 would be a pity :( The > 10% is so nice I want it. I'm rethinking this -- the majority of our users will not use x32. I don't think it's that useful for the majority to run all the simulations and have the memory footprint if only a small minority will use it. I also just checked Debian, and it has boot-time disabling of the x32 arch downstream [1]: CONFIG_X86_X32=y CONFIG_X86_X32_DISABLED=y Which means we will still generate all the code for x32 in seccomp even though people probably won't be using it... I also talked to some of my peers and they had a point regarding how x32 limiting address space to 4GiB is very harsh on many modern language runtimes, so even though it provides a 10% speed boost, its adoption is hard -- one has to compile all the C libraries in x32 in addition to x86_64, since one would have programs needing > 4GiB address space needing x86_64 version of the libraries. [1] https://wiki.debian.org/X32Port YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-21 5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu 2020-09-21 5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu 2020-09-21 5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu @ 2020-09-21 5:48 ` Sargun Dhillon 2020-09-21 7:13 ` YiFei Zhu 2020-09-21 8:30 ` Christian Brauner ` (5 subsequent siblings) 8 siblings, 1 reply; 149+ messages in thread From: Sargun Dhillon @ 2020-09-21 5:48 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook, YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas, bpf, Tianyin Xu On Sun, Sep 20, 2020 at 10:35 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > From: YiFei Zhu <yifeifz2@illinois.edu> > > This series adds a bitmap to cache seccomp filter results if the > result permits a syscall and is indepenent of syscall arguments. > This visibly decreases seccomp overhead for most common seccomp > filters with very little memory footprint. > > The overhead of running Seccomp filters has been part of some past > discussions [1][2][3]. Oftentimes, the filters have a large number > of instructions that check syscall numbers one by one and jump based > on that. Some users chain BPF filters which further enlarge the > overhead. A recent work [6] comprehensively measures the Seccomp > overhead and shows that the overhead is non-negligible and has a > non-trivial impact on application performance. > > We propose SECCOMP_CACHE, a cache-based solution to minimize the > Seccomp overhead. The basic idea is to cache the result of each > syscall check to save the subsequent overhead of executing the > filters. This is feasible, because the check in Seccomp is stateless. > The checking results of the same syscall ID and argument remains > the same. > > We observed some common filters, such as docker's [4] or > systemd's [5], will make most decisions based only on the syscall > numbers, and as past discussions considered, a bitmap where each bit > represents a syscall makes most sense for these filters. > > In the past Kees proposed [2] to have an "add this syscall to the > reject bitmask". It is indeed much easier to securely make a reject > accelerator to pre-filter syscalls before passing to the BPF > filters, considering it could only strengthen the security provided > by the filter. However, ultimately, filter rejections are an > exceptional / rare case. Here, instead of accelerating what is > rejected, we accelerate what is allowed. In order not to compromise > the security rules the BPF filters defined, any accept-side > accelerator must complement the BPF filters rather than replacing them. > > Statically analyzing BPF bytecode to see if each syscall is going to > always land in allow or reject is more of a rabbit hole, especially > there is no current in-kernel infrastructure to enumerate all the > possible architecture numbers for a given machine. So rather than > doing that, we propose to cache the results after the BPF filters are > run. And since there are filters like docker's who will check > arguments of some syscalls, but not all or none of the syscalls, when > a filter is loaded we analyze it to find whether each syscall is > cacheable (does not access syscall argument or instruction pointer) by > following its control flow graph, and store the result for each filter > in a bitmap. Changes to architecture number or the filter are expected > to be rare and simply cause the cache to be cleared. This solution > shall be fully transparent to userspace. Long-term, do you believe static analysis will be viable? I think that it is the "ideal" solution here, but I agree in that it is more complex. Is there a way to "prime" filters, by giving them a syscall #, and if it has a terminal condition without inspecting args, it turns into a bitmask entry viable? > > Ongoing work is to further support arguments with fast hash table > lookups. We are investigating the performance of doing so [6], and how > to best integrate with the existing seccomp infrastructure. > > We have done some benchmarks with patch applied against bpf-next > commit 2e80be60c465 ("libbpf: Fix compilation warnings for 64-bit printf args"). > > Me, in qemu-kvm x86_64 VM, on Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz, > average results: > > Without cache, seccomp_benchmark: > Current BPF sysctl settings: > net.core.bpf_jit_enable = 1 > net.core.bpf_jit_harden = 0 > Calibrating sample size for 15 seconds worth of syscalls ... > Benchmarking 23486415 syscalls... > 16.079642020 - 1.013345439 = 15066296581 (15.1s) > getpid native: 641 ns > 32.080237410 - 16.080763500 = 15999473910 (16.0s) > getpid RET_ALLOW 1 filter: 681 ns > 48.609461618 - 32.081296173 = 16528165445 (16.5s) > getpid RET_ALLOW 2 filters: 703 ns > Estimated total seccomp overhead for 1 filter: 40 ns > Estimated total seccomp overhead for 2 filters: 62 ns > Estimated seccomp per-filter overhead: 22 ns > Estimated seccomp entry overhead: 18 ns > > With cache: > Current BPF sysctl settings: > net.core.bpf_jit_enable = 1 > net.core.bpf_jit_harden = 0 > Calibrating sample size for 15 seconds worth of syscalls ... > Benchmarking 23486415 syscalls... > 16.059512499 - 1.014108434 = 15045404065 (15.0s) > getpid native: 640 ns > 31.651075934 - 16.060637323 = 15590438611 (15.6s) > getpid RET_ALLOW 1 filter: 663 ns > 47.367316169 - 31.652302661 = 15715013508 (15.7s) > getpid RET_ALLOW 2 filters: 669 ns > Estimated total seccomp overhead for 1 filter: 23 ns > Estimated total seccomp overhead for 2 filters: 29 ns > Estimated seccomp per-filter overhead: 6 ns > Estimated seccomp entry overhead: 17 ns > > Depending on the run estimated seccomp overhead for 2 filters can be > less than seccomp overhead for 1 filter, resulting in underflow to > estimated seccomp per-filter overhead: > Estimated total seccomp overhead for 1 filter: 27 ns > Estimated total seccomp overhead for 2 filters: 21 ns > Estimated seccomp per-filter overhead: 18446744073709551610 ns > Estimated seccomp entry overhead: 33 ns > > Jack Chen has also run some benchmarks on a bare metal > Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz, with side channel > mitigations off (spec_store_bypass_disable=off spectre_v2=off mds=off > pti=off l1tf=off), with BPF JIT on and docker default profile, > and reported: > > unixbench syscall mix (https://github.com/kdlucas/byte-unixbench) > unconfined: 33295685 > docker default: 20661056 60% > docker default + cache: 25719937 30% > > Patch 1 introduces the static analyzer to check for a given filter, > whether the CFG loads the syscall arguments for each syscall number. > > Patch 2 implements the bitmap cache. > > [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ > [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ > [3] https://github.com/seccomp/libseccomp/issues/116 > [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json > [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 > [6] Draco: Architectural and Operating System Support for System Call Security > https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 > > YiFei Zhu (2): > seccomp/cache: Add "emulator" to check if filter is arg-dependent > seccomp/cache: Cache filter results that allow syscalls > > arch/x86/Kconfig | 27 +++ > include/linux/seccomp.h | 22 +++ > kernel/seccomp.c | 400 +++++++++++++++++++++++++++++++++++++++- > 3 files changed, 446 insertions(+), 3 deletions(-) > > -- > 2.28.0 > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-21 5:48 ` [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls Sargun Dhillon @ 2020-09-21 7:13 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-21 7:13 UTC (permalink / raw) To: Sargun Dhillon Cc: Linux Containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook, YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas, bpf, Tianyin Xu On Mon, Sep 21, 2020 at 12:49 AM Sargun Dhillon <sargun@sargun.me> wrote: > > On Sun, Sep 20, 2020 at 10:35 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > > Long-term, do you believe static analysis will be viable? I think that it is > the "ideal" solution here, but I agree in that it is more complex. > > Is there a way to "prime" filters, by giving them a syscall #, and if it has > a terminal condition without inspecting args, it turns into a bitmask entry > viable? I think in theory one could follow the execution of the filter, and if the filter is determined to return a pass for a given syscall number under all circumstances, we record that syscall. We can then replace the bitmap_zero call in seccomp_cache_check with a call to bitmap_copy from the pre-primed bitmap. However, I don't know how much benefit this would provide. One ugly part of the current situation is that the kernel has absolutely no idea what arch numbers returned by syscall_get_arch may be possible for the machine it is running on. For example, for an x86_64 machine with IA32 emulation, the arch number can be either AUDIT_ARCH_I386 or AUDIT_ARCH_X86_64. The seccomp filter will typically have parts handling both cases. As a result, an uncertainty for one syscall on one arch will affect the syscall under the same number for the other arch. If a syscall number is not guaranteed to be allowed under both arches, it won't be primed. Given that usually a seccomp filter is a list of allowed syscalls, my guess is that there won't be many syscalls numbers that will fall under this case; though, I have not tested this. We could add an array of possible arch numbers so that the emulator can refine its tracing. This is probably the best in effort, though, seccomp_cache_prepare now has to iterate through all combinations of syscall numbers and arch numbers. Given that seccomp_cache_prepare should be relatively cold it's probably not too much of a trouble. Alternatively, we could employ constraint tracking, but that sounds overly complex for what we are trying to do. The other question would be, would pre-priming the cache be worth the effort? The assumption is that the vast majority of cacheable syscalls will be permitted. For them, only the first time a particular syscall is invoked would experience the overhead of calling the filter, which means that this part of the initial run we are going to optimize out by pre-priming is going to be relatively cold. wdyt? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-21 5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu ` (2 preceding siblings ...) 2020-09-21 5:48 ` [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls Sargun Dhillon @ 2020-09-21 8:30 ` Christian Brauner 2020-09-21 8:44 ` YiFei Zhu 2020-09-21 13:51 ` Tycho Andersen ` (4 subsequent siblings) 8 siblings, 1 reply; 149+ messages in thread From: Christian Brauner @ 2020-09-21 8:30 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Jann Horn, Aleksa Sarai, linux-kernel On Mon, Sep 21, 2020 at 12:35:16AM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > This series adds a bitmap to cache seccomp filter results if the > result permits a syscall and is indepenent of syscall arguments. > This visibly decreases seccomp overhead for most common seccomp > filters with very little memory footprint. This is missing some people so expanding the Cc a little. Make sure to run scripts/get_maintainers.pl next time, in case you forgot. (Adding Andy, Will, Jann, Aleksa at least.) Christian > > The overhead of running Seccomp filters has been part of some past > discussions [1][2][3]. Oftentimes, the filters have a large number > of instructions that check syscall numbers one by one and jump based > on that. Some users chain BPF filters which further enlarge the > overhead. A recent work [6] comprehensively measures the Seccomp > overhead and shows that the overhead is non-negligible and has a > non-trivial impact on application performance. > > We propose SECCOMP_CACHE, a cache-based solution to minimize the > Seccomp overhead. The basic idea is to cache the result of each > syscall check to save the subsequent overhead of executing the > filters. This is feasible, because the check in Seccomp is stateless. > The checking results of the same syscall ID and argument remains > the same. > > We observed some common filters, such as docker's [4] or > systemd's [5], will make most decisions based only on the syscall > numbers, and as past discussions considered, a bitmap where each bit > represents a syscall makes most sense for these filters. > > In the past Kees proposed [2] to have an "add this syscall to the > reject bitmask". It is indeed much easier to securely make a reject > accelerator to pre-filter syscalls before passing to the BPF > filters, considering it could only strengthen the security provided > by the filter. However, ultimately, filter rejections are an > exceptional / rare case. Here, instead of accelerating what is > rejected, we accelerate what is allowed. In order not to compromise > the security rules the BPF filters defined, any accept-side > accelerator must complement the BPF filters rather than replacing them. > > Statically analyzing BPF bytecode to see if each syscall is going to > always land in allow or reject is more of a rabbit hole, especially > there is no current in-kernel infrastructure to enumerate all the > possible architecture numbers for a given machine. So rather than > doing that, we propose to cache the results after the BPF filters are > run. And since there are filters like docker's who will check > arguments of some syscalls, but not all or none of the syscalls, when > a filter is loaded we analyze it to find whether each syscall is > cacheable (does not access syscall argument or instruction pointer) by > following its control flow graph, and store the result for each filter > in a bitmap. Changes to architecture number or the filter are expected > to be rare and simply cause the cache to be cleared. This solution > shall be fully transparent to userspace. > > Ongoing work is to further support arguments with fast hash table > lookups. We are investigating the performance of doing so [6], and how > to best integrate with the existing seccomp infrastructure. > > We have done some benchmarks with patch applied against bpf-next > commit 2e80be60c465 ("libbpf: Fix compilation warnings for 64-bit printf args"). > > Me, in qemu-kvm x86_64 VM, on Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz, > average results: > > Without cache, seccomp_benchmark: > Current BPF sysctl settings: > net.core.bpf_jit_enable = 1 > net.core.bpf_jit_harden = 0 > Calibrating sample size for 15 seconds worth of syscalls ... > Benchmarking 23486415 syscalls... > 16.079642020 - 1.013345439 = 15066296581 (15.1s) > getpid native: 641 ns > 32.080237410 - 16.080763500 = 15999473910 (16.0s) > getpid RET_ALLOW 1 filter: 681 ns > 48.609461618 - 32.081296173 = 16528165445 (16.5s) > getpid RET_ALLOW 2 filters: 703 ns > Estimated total seccomp overhead for 1 filter: 40 ns > Estimated total seccomp overhead for 2 filters: 62 ns > Estimated seccomp per-filter overhead: 22 ns > Estimated seccomp entry overhead: 18 ns > > With cache: > Current BPF sysctl settings: > net.core.bpf_jit_enable = 1 > net.core.bpf_jit_harden = 0 > Calibrating sample size for 15 seconds worth of syscalls ... > Benchmarking 23486415 syscalls... > 16.059512499 - 1.014108434 = 15045404065 (15.0s) > getpid native: 640 ns > 31.651075934 - 16.060637323 = 15590438611 (15.6s) > getpid RET_ALLOW 1 filter: 663 ns > 47.367316169 - 31.652302661 = 15715013508 (15.7s) > getpid RET_ALLOW 2 filters: 669 ns > Estimated total seccomp overhead for 1 filter: 23 ns > Estimated total seccomp overhead for 2 filters: 29 ns > Estimated seccomp per-filter overhead: 6 ns > Estimated seccomp entry overhead: 17 ns > > Depending on the run estimated seccomp overhead for 2 filters can be > less than seccomp overhead for 1 filter, resulting in underflow to > estimated seccomp per-filter overhead: > Estimated total seccomp overhead for 1 filter: 27 ns > Estimated total seccomp overhead for 2 filters: 21 ns > Estimated seccomp per-filter overhead: 18446744073709551610 ns > Estimated seccomp entry overhead: 33 ns > > Jack Chen has also run some benchmarks on a bare metal > Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz, with side channel > mitigations off (spec_store_bypass_disable=off spectre_v2=off mds=off > pti=off l1tf=off), with BPF JIT on and docker default profile, > and reported: > > unixbench syscall mix (https://github.com/kdlucas/byte-unixbench) > unconfined: 33295685 > docker default: 20661056 60% > docker default + cache: 25719937 30% > > Patch 1 introduces the static analyzer to check for a given filter, > whether the CFG loads the syscall arguments for each syscall number. > > Patch 2 implements the bitmap cache. > > [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ > [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ > [3] https://github.com/seccomp/libseccomp/issues/116 > [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json > [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 > [6] Draco: Architectural and Operating System Support for System Call Security > https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 > > YiFei Zhu (2): > seccomp/cache: Add "emulator" to check if filter is arg-dependent > seccomp/cache: Cache filter results that allow syscalls > > arch/x86/Kconfig | 27 +++ > include/linux/seccomp.h | 22 +++ > kernel/seccomp.c | 400 +++++++++++++++++++++++++++++++++++++++- > 3 files changed, 446 insertions(+), 3 deletions(-) > > -- > 2.28.0 ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-21 8:30 ` Christian Brauner @ 2020-09-21 8:44 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-21 8:44 UTC (permalink / raw) To: Christian Brauner Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Jann Horn, Aleksa Sarai, linux-kernel On Mon, Sep 21, 2020 at 3:30 AM Christian Brauner <christian.brauner@ubuntu.com> wrote: > This is missing some people so expanding the Cc a little. Make sure to > run scripts/get_maintainers.pl next time, in case you forgot. (Adding > Andy, Will, Jann, Aleksa at least.) > > Christian ok noted. Thanks! YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-21 5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu ` (3 preceding siblings ...) 2020-09-21 8:30 ` Christian Brauner @ 2020-09-21 13:51 ` Tycho Andersen 2020-09-21 15:27 ` YiFei Zhu 2020-09-21 19:16 ` Jann Horn ` (3 subsequent siblings) 8 siblings, 1 reply; 149+ messages in thread From: Tycho Andersen @ 2020-09-21 13:51 UTC (permalink / raw) To: YiFei Zhu Cc: containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook, YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas, bpf, Tianyin Xu On Mon, Sep 21, 2020 at 12:35:16AM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > This series adds a bitmap to cache seccomp filter results if the > result permits a syscall and is indepenent of syscall arguments. > This visibly decreases seccomp overhead for most common seccomp > filters with very little memory footprint. > > The overhead of running Seccomp filters has been part of some past > discussions [1][2][3]. Oftentimes, the filters have a large number > of instructions that check syscall numbers one by one and jump based > on that. Some users chain BPF filters which further enlarge the > overhead. A recent work [6] comprehensively measures the Seccomp > overhead and shows that the overhead is non-negligible and has a > non-trivial impact on application performance. > > We propose SECCOMP_CACHE, a cache-based solution to minimize the > Seccomp overhead. The basic idea is to cache the result of each > syscall check to save the subsequent overhead of executing the > filters. This is feasible, because the check in Seccomp is stateless. > The checking results of the same syscall ID and argument remains > the same. > > We observed some common filters, such as docker's [4] or > systemd's [5], will make most decisions based only on the syscall > numbers, and as past discussions considered, a bitmap where each bit > represents a syscall makes most sense for these filters. One problem with a kernel config setting is that it's for all tasks. While docker and systemd may make decsisions based on syscall number, other applications may have more nuanced filters, and this cache would yield incorrect results. You could work around this by making this a filter flag instead; filter authors would generally know whether their filter results can be cached and probably be motivated to opt in if their users are complaining about slow syscall execution. Tycho ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-21 13:51 ` Tycho Andersen @ 2020-09-21 15:27 ` YiFei Zhu 2020-09-21 16:39 ` Tycho Andersen 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-21 15:27 UTC (permalink / raw) To: Tycho Andersen Cc: Linux Containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook, YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas, bpf, Tianyin Xu, Andy Lutomirski, Will Drewry, Jann Horn, Aleksa Sarai, linux-kernel On Mon, Sep 21, 2020 at 8:51 AM Tycho Andersen <tycho@tycho.pizza> wrote: > One problem with a kernel config setting is that it's for all tasks. > While docker and systemd may make decsisions based on syscall number, > other applications may have more nuanced filters, and this cache would > yield incorrect results. > > You could work around this by making this a filter flag instead; > filter authors would generally know whether their filter results can > be cached and probably be motivated to opt in if their users are > complaining about slow syscall execution. > > Tycho Yielding incorrect results should not be possible. The purpose of the "emulator" (for the lack of a better term) is to determine whether the filter reads any syscall arguments. A read from a syscall argument must go through the BPF_LD | BPF_ABS instruction, where the 32 bit multiuse field "k" is an offset to struct seccomp_data. struct seccomp_data contains four components [1]: syscall number, architecture number, instruction pointer at the time of syscall, and syscall arguments. The syscall number is enumerated by the emulator. The arch number is treated by the cache as 'if arch number is different from cached arch number, flush cache' (this is in seccomp_cache_check). The last two (ip and args) are treated exactly the same way in this patch: if the filter loads the arguments at all, the syscall is marked non-cacheable for any architecture number. The struct seccomp_data is the only external thing the filter may access. It is also cBPF so it cannot contain maps to store special states between runs. Therefore a seccomp filter is pure function. If we know given some inputs (the syscall number and arch number) the function will not evaluate any other inputs before returning, then we can safely cache with just the inputs in concern. As for the overhead, on my x86_64 with gcc 10.2.0, seccomp_cache_check compiles into: if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls)) return false; 0xffffffff8120fdb3 <+99>: movsxd rdx,DWORD PTR [r12] 0xffffffff8120fdb7 <+103>: cmp edx,0x1b7 0xffffffff8120fdbd <+109>: ja 0xffffffff8120fdf9 <__seccomp_filter+169> if (unlikely(thread_data->last_filter != sfilter || thread_data->last_arch != sd->arch)) { 0xffffffff8120fdbf <+111>: mov rdi,QWORD PTR [rbp-0xb8] 0xffffffff8120fdc6 <+118>: lea rsi,[rax+0x6f0] 0xffffffff8120fdcd <+125>: cmp rdi,QWORD PTR [rax+0x728] 0xffffffff8120fdd4 <+132>: jne 0xffffffff812101f0 <__seccomp_filter+1184> 0xffffffff8120fdda <+138>: mov ebx,DWORD PTR [r12+0x4] 0xffffffff8120fddf <+143>: cmp DWORD PTR [rax+0x730],ebx 0xffffffff8120fde5 <+149>: jne 0xffffffff812101f0 <__seccomp_filter+1184> return test_bit(syscall_nr, thread_data->syscall_ok); 0xffffffff8120fdeb <+155>: bt QWORD PTR [rax+0x6f0],rdx 0xffffffff8120fdf3 <+163>: jb 0xffffffff8120ffb7 <__seccomp_filter+615> [... unlikely path of cache flush omitted] and seccomp_cache_insert compiles into: if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls)) return; 0xffffffff8121021b <+1227>: movsxd rax,DWORD PTR [r12] 0xffffffff8121021f <+1231>: cmp eax,0x1b7 0xffffffff81210224 <+1236>: ja 0xffffffff8120ffb7 <__seccomp_filter+615> if (!test_bit(syscall_nr, sfilter->cache.syscall_ok)) return; 0xffffffff8121022a <+1242>: mov rbx,QWORD PTR [rbp-0xb8] 0xffffffff81210231 <+1249>: mov rdx,QWORD PTR gs:0x17000 0xffffffff8121023a <+1258>: bt QWORD PTR [rbx+0x108],rax 0xffffffff81210242 <+1266>: jae 0xffffffff8120ffb7 <__seccomp_filter+615> set_bit(syscall_nr, thread_data->syscall_ok); 0xffffffff81210248 <+1272>: lock bts QWORD PTR [rdx+0x6f0],rax 0xffffffff81210251 <+1281>: jmp 0xffffffff8120ffb7 <__seccomp_filter+615> In the circumstance of a non-cacheable syscall happening over and over, the code path would go through the syscall_nr bound check, then the filter flush check, then the test_bit, then another syscall_nr bound check and one more test_bit in seccomp_cache_insert. Considering that they are either stack variables, elements of current task_struct, and elements of the filter struct, I imagine they would well be in the CPU data cache and not incur much overhead. The CPU is also free to branch predict and reorder memory accesses (there are no hardware memory barriers here) to further increase the efficiency, whereas a normal filter execution would be impaired by things like retpoline. If one were to add an additional flag for does-userspace-want-us-to-cache, it would still be a member of the filter struct. What would be loaded into the CPU data cache originally would still be loaded. Correct me if I'm wrong, but I don't think that check will reduce any significant overhead of the seccomp cache itself. That said, I have not profiled the exact number of milliseconds this patch would incur to uncacheable syscalls, I can report back with numbers if you would like to see. Does that answer your concern? YiFei Zhu [1] https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/seccomp.h#L60 ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-21 15:27 ` YiFei Zhu @ 2020-09-21 16:39 ` Tycho Andersen 2020-09-21 22:57 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Tycho Andersen @ 2020-09-21 16:39 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook, YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas, bpf, Tianyin Xu, Andy Lutomirski, Will Drewry, Jann Horn, Aleksa Sarai, linux-kernel On Mon, Sep 21, 2020 at 10:27:56AM -0500, YiFei Zhu wrote: > On Mon, Sep 21, 2020 at 8:51 AM Tycho Andersen <tycho@tycho.pizza> wrote: > > One problem with a kernel config setting is that it's for all tasks. > > While docker and systemd may make decsisions based on syscall number, > > other applications may have more nuanced filters, and this cache would > > yield incorrect results. > > > > You could work around this by making this a filter flag instead; > > filter authors would generally know whether their filter results can > > be cached and probably be motivated to opt in if their users are > > complaining about slow syscall execution. > > > > Tycho > > Yielding incorrect results should not be possible. The purpose of the > "emulator" (for the lack of a better term) is to determine whether the > filter reads any syscall arguments. A read from a syscall argument > must go through the BPF_LD | BPF_ABS instruction, where the 32 bit > multiuse field "k" is an offset to struct seccomp_data. I see, I missed this somehow. So is there a reason to hide this behind a config option? Isn't it just always better? Tycho ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-21 16:39 ` Tycho Andersen @ 2020-09-21 22:57 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-21 22:57 UTC (permalink / raw) To: Tycho Andersen Cc: Linux Containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook, YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas, bpf, Tianyin Xu, Andy Lutomirski, Will Drewry, Jann Horn, Aleksa Sarai, kernel list On Mon, Sep 21, 2020 at 11:39 AM Tycho Andersen <tycho@tycho.pizza> wrote: > I see, I missed this somehow. So is there a reason to hide this behind > a config option? Isn't it just always better? > > Tycho You have a good point, though, I think keeping a config would allow people to "test the differences" in the unlikely case that some issue occurs. Jann pointed that it should be on by default so I'll do that. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-21 5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu ` (4 preceding siblings ...) 2020-09-21 13:51 ` Tycho Andersen @ 2020-09-21 19:16 ` Jann Horn [not found] ` <OF8837FC1A.5C0D4D64-ON852585EA.006B677F-852585EA.006BA663@notes.na.collabserv.com> 2020-09-23 19:26 ` Kees Cook ` (2 subsequent siblings) 8 siblings, 1 reply; 149+ messages in thread From: Jann Horn @ 2020-09-21 19:16 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski, Will Drewry, Aleksa Sarai, kernel list On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > This series adds a bitmap to cache seccomp filter results if the > result permits a syscall and is indepenent of syscall arguments. > This visibly decreases seccomp overhead for most common seccomp > filters with very little memory footprint. It would be really nice if, based on this, we could have a new entry in procfs that has one line per entry in each syscall table. Maybe something that looks vaguely like: X86_64 0 (read): ALLOW X86_64 1 (write): ALLOW X86_64 2 (open): ERRNO -1 X86_64 3 (close): ALLOW X86_64 4 (stat): <argument-dependent> [...] I386 0 (restart_syscall): ALLOW I386 1 (exit): ALLOW I386 2 (fork): KILL [...] This would be useful both for inspectability (at the moment it's pretty hard to figure out what seccomp rules really apply to a given task) and for testing (so that we could easily write unit tests to verify that the bitmap calculation works as expected). But if you don't want to implement that right now, we can do that at a later point - while it would be nice for making it easier to write tests for this functionality, I don't see it as a blocker. > The overhead of running Seccomp filters has been part of some past > discussions [1][2][3]. Oftentimes, the filters have a large number > of instructions that check syscall numbers one by one and jump based > on that. Some users chain BPF filters which further enlarge the > overhead. A recent work [6] comprehensively measures the Seccomp > overhead and shows that the overhead is non-negligible and has a > non-trivial impact on application performance. > > We propose SECCOMP_CACHE, a cache-based solution to minimize the > Seccomp overhead. The basic idea is to cache the result of each > syscall check to save the subsequent overhead of executing the > filters. This is feasible, because the check in Seccomp is stateless. > The checking results of the same syscall ID and argument remains > the same. > > We observed some common filters, such as docker's [4] or > systemd's [5], will make most decisions based only on the syscall > numbers, and as past discussions considered, a bitmap where each bit > represents a syscall makes most sense for these filters. [...] > Statically analyzing BPF bytecode to see if each syscall is going to > always land in allow or reject is more of a rabbit hole, especially > there is no current in-kernel infrastructure to enumerate all the > possible architecture numbers for a given machine. You could add that though. Or if you think that that's too much work, you could just do it for x86 and arm64 and then use a Kconfig dependency to limit this to those architectures for now. > So rather than > doing that, we propose to cache the results after the BPF filters are > run. Please don't add extra complexity just to work around a limitation in existing code if you could instead remove that limitation in existing code. Otherwise, code will become unnecessarily hard to understand and inefficient. You could let struct seccomp_filter contain three bitmasks - one for the "native" architecture and up to two for "compat" architectures (gated on some Kconfig flag). alpha has 1 architecture numbers, arc has 1 (per build config), arm has 1, arm64 has 2, c6x has 1 (per build config), csky has 1, h8300 has 1, hexagon has 1, ia64 has 1, m68k has 1, microblaze has 1, mips has 3 (per build config), nds32 has 1 (per build config), nios2 has 1, openrisc has 1, parisc has 2, powerpc has 2 (per build config), riscv has 1 (per build config), s390 has 2, sh has 1 (per build config), sparc has 2, x86 has 2, xtensa has 1. > And since there are filters like docker's who will check > arguments of some syscalls, but not all or none of the syscalls, when > a filter is loaded we analyze it to find whether each syscall is > cacheable (does not access syscall argument or instruction pointer) by > following its control flow graph, and store the result for each filter > in a bitmap. Changes to architecture number or the filter are expected > to be rare and simply cause the cache to be cleared. This solution > shall be fully transparent to userspace. Caching whether a given syscall number has fixed per-architecture results across all architectures is a pretty gross hack, please don't. > Ongoing work is to further support arguments with fast hash table > lookups. We are investigating the performance of doing so [6], and how > to best integrate with the existing seccomp infrastructure. ^ permalink raw reply [flat|nested] 149+ messages in thread
[parent not found: <OF8837FC1A.5C0D4D64-ON852585EA.006B677F-852585EA.006BA663@notes.na.collabserv.com>]
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls [not found] ` <OF8837FC1A.5C0D4D64-ON852585EA.006B677F-852585EA.006BA663@notes.na.collabserv.com> @ 2020-09-21 19:45 ` Jann Horn 0 siblings, 0 replies; 149+ messages in thread From: Jann Horn @ 2020-09-21 19:45 UTC (permalink / raw) To: Hubertus Franke Cc: Andrea Arcangeli, bpf, Linux Containers, Aleksa Sarai, Dimitrios Skarlatos, Giuseppe Scrivano, Jack Chen, Kees Cook, kernel list, Andy Lutomirski, Tobin Feldman-Fitzthum, Josep Torrellas, Tianyin Xu, Valentin Rothberg, Will Drewry, YiFei Zhu, YiFei Zhu On Mon, Sep 21, 2020 at 9:35 PM Hubertus Franke <frankeh@us.ibm.com> wrote: > I suggest we first bring it down to the minimal features we what and successively build the functions as these ideas evolve. > We asked YiFei to prepare a minimal set that brings home the basic features. Might not be 100% optimal but having the hooks, the basic cache in place and getting a good benefit should be a good starting point > to get this integrated into a linux kernel and then enable a larger experimentation. > Does that make sense to approach it from that point ? Sure. As I said, I don't think that the procfs part is a blocker - if YiFei doesn't want to implement it now, I don't think it's necessary. (But it would make it possible to write more precise tests.) By the way: Please don't top-post on mailing lists - instead, quote specific parts of a message and reply below those quotes. Also, don't send HTML mail to kernel mailing lists, because they will reject it. ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-21 5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu ` (5 preceding siblings ...) 2020-09-21 19:16 ` Jann Horn @ 2020-09-23 19:26 ` Kees Cook 2020-09-23 22:54 ` YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu 2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 8 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-23 19:26 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg On Mon, Sep 21, 2020 at 12:35:16AM -0500, YiFei Zhu wrote: > In the past Kees proposed [2] to have an "add this syscall to the > reject bitmask". It is indeed much easier to securely make a reject > accelerator to pre-filter syscalls before passing to the BPF > filters, considering it could only strengthen the security provided > by the filter. However, ultimately, filter rejections are an > exceptional / rare case. Here, instead of accelerating what is > rejected, we accelerate what is allowed. In order not to compromise > the security rules the BPF filters defined, any accept-side > accelerator must complement the BPF filters rather than replacing them. Did you see the RFC series for this? https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/ > Without cache, seccomp_benchmark: > Current BPF sysctl settings: > net.core.bpf_jit_enable = 1 > net.core.bpf_jit_harden = 0 > Calibrating sample size for 15 seconds worth of syscalls ... > Benchmarking 23486415 syscalls... > 16.079642020 - 1.013345439 = 15066296581 (15.1s) > getpid native: 641 ns > 32.080237410 - 16.080763500 = 15999473910 (16.0s) > getpid RET_ALLOW 1 filter: 681 ns > 48.609461618 - 32.081296173 = 16528165445 (16.5s) > getpid RET_ALLOW 2 filters: 703 ns > Estimated total seccomp overhead for 1 filter: 40 ns > Estimated total seccomp overhead for 2 filters: 62 ns > Estimated seccomp per-filter overhead: 22 ns > Estimated seccomp entry overhead: 18 ns > > With cache: > Current BPF sysctl settings: > net.core.bpf_jit_enable = 1 > net.core.bpf_jit_harden = 0 > Calibrating sample size for 15 seconds worth of syscalls ... > Benchmarking 23486415 syscalls... > 16.059512499 - 1.014108434 = 15045404065 (15.0s) > getpid native: 640 ns > 31.651075934 - 16.060637323 = 15590438611 (15.6s) > getpid RET_ALLOW 1 filter: 663 ns > 47.367316169 - 31.652302661 = 15715013508 (15.7s) > getpid RET_ALLOW 2 filters: 669 ns > Estimated total seccomp overhead for 1 filter: 23 ns > Estimated total seccomp overhead for 2 filters: 29 ns > Estimated seccomp per-filter overhead: 6 ns > Estimated seccomp entry overhead: 17 ns > > Depending on the run estimated seccomp overhead for 2 filters can be > less than seccomp overhead for 1 filter, resulting in underflow to > estimated seccomp per-filter overhead: > Estimated total seccomp overhead for 1 filter: 27 ns > Estimated total seccomp overhead for 2 filters: 21 ns > Estimated seccomp per-filter overhead: 18446744073709551610 ns > Estimated seccomp entry overhead: 33 ns Which also includes updated benchmarking: https://lore.kernel.org/lkml/20200616074934.1600036-6-keescook@chromium.org/ -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-23 19:26 ` Kees Cook @ 2020-09-23 22:54 ` YiFei Zhu 2020-09-24 6:52 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-23 22:54 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg On Wed, Sep 23, 2020 at 2:26 PM Kees Cook <keescook@chromium.org> wrote: > Did you see the RFC series for this? > > https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/ > [...] > Which also includes updated benchmarking: > > https://lore.kernel.org/lkml/20200616074934.1600036-6-keescook@chromium.org/ Nice. I was not aware of that series. Looking at it, it seems that our reasoning for checking arch and nr only, and verify if the filter accesses anything else, is the same. However, the approach in that RFC used was some page table dark magic, and it has been concluded that an emulator is superior. Was there a seperate patch series with emulator? If not, would you mind me cherry-picking some of your changes in that series? Also, I see that BPF_AND is said to be used in the discussion of the linked series. I think it wouldn't hurt to emulate a few BPF_ALU so I'll add that. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-23 22:54 ` YiFei Zhu @ 2020-09-24 6:52 ` Kees Cook 0 siblings, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-09-24 6:52 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg On Wed, Sep 23, 2020 at 05:54:51PM -0500, YiFei Zhu wrote: > On Wed, Sep 23, 2020 at 2:26 PM Kees Cook <keescook@chromium.org> wrote: > > Did you see the RFC series for this? > > > > https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/ > > [...] > > Which also includes updated benchmarking: > > > > https://lore.kernel.org/lkml/20200616074934.1600036-6-keescook@chromium.org/ > > Nice. I was not aware of that series. Looking at it, it seems that our > reasoning for checking arch and nr only, and verify if the filter > accesses anything else, is the same. However, the approach in that RFC > used was some page table dark magic, and it has been concluded that an > emulator is superior. Was there a seperate patch series with emulator? > If not, would you mind me cherry-picking some of your changes in that > series? I've sent that series refreshed with Jann's emulator now[1]. (Which I see you've replied to as well, but I figured I'd just link these threads for any future archaeology. ;) > Also, I see that BPF_AND is said to be used in the discussion of the > linked series. I think it wouldn't hurt to emulate a few BPF_ALU so > I'll add that. If you could add ALU|AND, that would get us complete coverage for libseccomp and Chrome. I don't want the emulator to get any more complex than that, as I view it as fairly high risk part. As you can see, I tried really hard to _not_ use an emulator in the RFC. ;) [1] https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/ -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-21 5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu ` (6 preceding siblings ...) 2020-09-23 19:26 ` Kees Cook @ 2020-09-24 12:06 ` YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu ` (6 more replies) 2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 8 siblings, 7 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/ Major differences from the linked alternative by Kees: * No x32 special-case handling -- not worth the complexity * No caching of denylist -- not worth the complexity * No seccomp arch pinning -- I think this is an independent feature * The bitmaps are part of the filters rather than the task. * Architectures supported by default through arch number array, except for MIPS with its sparse syscall numbers. * Configurable per-build for future different cache modes. This series adds a bitmap to cache seccomp filter results if the result permits a syscall and is indepenent of syscall arguments. This visibly decreases seccomp overhead for most common seccomp filters with very little memory footprint. The overhead of running Seccomp filters has been part of some past discussions [1][2][3]. Oftentimes, the filters have a large number of instructions that check syscall numbers one by one and jump based on that. Some users chain BPF filters which further enlarge the overhead. A recent work [6] comprehensively measures the Seccomp overhead and shows that the overhead is non-negligible and has a non-trivial impact on application performance. We observed some common filters, such as docker's [4] or systemd's [5], will make most decisions based only on the syscall numbers, and as past discussions considered, a bitmap where each bit represents a syscall makes most sense for these filters. In order to build this bitmap at filter attach time, each filter is emulated for every syscall (under each possible architecture), and checked for any accesses of struct seccomp_data that are not the "arch" nor "nr" (syscall) members. If only "arch" and "nr" are examined, and the program returns allow, then we can be sure that the filter must return allow independent from syscall arguments. When it is concluded that an allow must occur for the given architecture and syscall pair, seccomp will immediately allow the syscall, bypassing further BPF execution. Ongoing work is to further support arguments with fast hash table lookups. We are investigating the performance of doing so [6], and how to best integrate with the existing seccomp infrastructure. Some benchmarks are performed with results in patch 5, copied below: Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 100000000 syscalls... 63.896255358 - 0.008504529 = 63887750829 (63.9s) getpid native: 638 ns 130.383312423 - 63.897315189 = 66485997234 (66.5s) getpid RET_ALLOW 1 filter (bitmap): 664 ns 196.789080421 - 130.384414983 = 66404665438 (66.4s) getpid RET_ALLOW 2 filters (bitmap): 664 ns 268.844643304 - 196.790234168 = 72054409136 (72.1s) getpid RET_ALLOW 3 filters (full): 720 ns 342.627472515 - 268.845799103 = 73781673412 (73.8s) getpid RET_ALLOW 4 filters (full): 737 ns Estimated total seccomp overhead for 1 bitmapped filter: 26 ns Estimated total seccomp overhead for 2 bitmapped filters: 26 ns Estimated total seccomp overhead for 3 full filters: 82 ns Estimated total seccomp overhead for 4 full filters: 99 ns Estimated seccomp entry overhead: 26 ns Estimated seccomp per-filter overhead (last 2 diff): 17 ns Estimated seccomp per-filter overhead (filters / 4): 18 ns Expectations: native ≤ 1 bitmap (638 ≤ 664): ✔️ native ≤ 1 filter (638 ≤ 720): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️ 1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️ entry ≈ 1 bitmapped (26 ≈ 26): ✔️ entry ≈ 2 bitmapped (26 ≈ 26): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️ RFC -> v1: * Config made on by default across all arches that could support it. * Added arch numbers array and emulate filter for each arch number, and have a per-arch bitmap. * Massively simplified the emulator so it would only support the common instructions in Kees's list. * Fixed inheriting bitmap across filters (filter->prev is always NULL during prepare). * Stole the selftest from Kees. * Added a /proc/pid/seccomp_cache by Jann's suggestion. Patch 1 moves the SECCOMP Kcomfig option to arch/Kconfig. Patch 2 adds a syscall_arches array so the emulator can enumerate it. Patch 3 implements the emulator that finds if a filter must return allow, Patch 4 implements the test_bit against the bitmaps. Patch 5 updates the selftest to better show the new semantics. Patch 6 implements /proc/pid/seccomp_cache. [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ [3] https://github.com/seccomp/libseccomp/issues/116 [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 [6] Draco: Architectural and Operating System Support for System Call Security https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 Kees Cook (1): selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu (5): seccomp: Move config option SECCOMP to arch/Kconfig asm/syscall.h: Add syscall_arches[] array seccomp/cache: Add "emulator" to check if filter is arg-dependent seccomp/cache: Lookup syscall allowlist for fast path seccomp/cache: Report cache data through /proc/pid/seccomp_cache arch/Kconfig | 56 ++++ arch/alpha/include/asm/syscall.h | 4 + arch/arc/include/asm/syscall.h | 24 +- arch/arm/Kconfig | 15 +- arch/arm/include/asm/syscall.h | 4 + arch/arm64/Kconfig | 13 - arch/arm64/include/asm/syscall.h | 4 + arch/c6x/include/asm/syscall.h | 13 +- arch/csky/Kconfig | 13 - arch/csky/include/asm/syscall.h | 4 + arch/h8300/include/asm/syscall.h | 4 + arch/hexagon/include/asm/syscall.h | 4 + arch/ia64/include/asm/syscall.h | 4 + arch/m68k/include/asm/syscall.h | 4 + arch/microblaze/Kconfig | 18 +- arch/microblaze/include/asm/syscall.h | 4 + arch/mips/Kconfig | 17 -- arch/mips/include/asm/syscall.h | 16 ++ arch/nds32/include/asm/syscall.h | 13 +- arch/nios2/include/asm/syscall.h | 4 + arch/openrisc/include/asm/syscall.h | 4 + arch/parisc/Kconfig | 16 -- arch/parisc/include/asm/syscall.h | 7 + arch/powerpc/Kconfig | 17 -- arch/powerpc/include/asm/syscall.h | 14 + arch/riscv/Kconfig | 13 - arch/riscv/include/asm/syscall.h | 14 +- arch/s390/Kconfig | 17 -- arch/s390/include/asm/syscall.h | 7 + arch/sh/Kconfig | 16 -- arch/sh/include/asm/syscall_32.h | 17 +- arch/sparc/Kconfig | 18 +- arch/sparc/include/asm/syscall.h | 9 + arch/um/Kconfig | 16 -- arch/x86/Kconfig | 16 -- arch/x86/include/asm/syscall.h | 11 + arch/x86/um/asm/syscall.h | 14 +- arch/xtensa/Kconfig | 14 - arch/xtensa/include/asm/syscall.h | 4 + fs/proc/base.c | 7 +- include/linux/seccomp.h | 5 + kernel/seccomp.c | 259 +++++++++++++++++- .../selftests/seccomp/seccomp_benchmark.c | 151 ++++++++-- tools/testing/selftests/seccomp/settings | 2 +- 44 files changed, 641 insertions(+), 265 deletions(-) -- 2.28.0 ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig 2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu @ 2020-09-24 12:06 ` YiFei Zhu 2020-09-24 12:06 ` YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu ` (5 subsequent siblings) 6 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> In order to make adding configurable features into seccomp easier, it's better to have the options at one single location, considering easpecially that the bulk of seccomp code is arch-independent. An quick look also show that many SECCOMP descriptions are outdated; they talk about /proc rather than prctl. As a result of moving the config option and keeping it default on, architectures arm, arm64, csky, riscv, sh, and xtensa did not have SECCOMP on by default prior to this and SECCOMP will be default in this change. Architectures microblaze, mips, powerpc, s390, sh, and sparc have an outdated depend on PROC_FS and this dependency is removed in this change. Suggested-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/Kconfig | 21 +++++++++++++++++++++ arch/arm/Kconfig | 15 +-------------- arch/arm64/Kconfig | 13 ------------- arch/csky/Kconfig | 13 ------------- arch/microblaze/Kconfig | 18 +----------------- arch/mips/Kconfig | 17 ----------------- arch/parisc/Kconfig | 16 ---------------- arch/powerpc/Kconfig | 17 ----------------- arch/riscv/Kconfig | 13 ------------- arch/s390/Kconfig | 17 ----------------- arch/sh/Kconfig | 16 ---------------- arch/sparc/Kconfig | 18 +----------------- arch/um/Kconfig | 16 ---------------- arch/x86/Kconfig | 16 ---------------- arch/xtensa/Kconfig | 14 -------------- 15 files changed, 24 insertions(+), 216 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index af14a567b493..6dfc5673215d 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -444,8 +444,12 @@ config ARCH_WANT_OLD_COMPAT_IPC select ARCH_WANT_COMPAT_IPC_PARSE_VERSION bool +config HAVE_ARCH_SECCOMP + bool + config HAVE_ARCH_SECCOMP_FILTER bool + select HAVE_ARCH_SECCOMP help An arch should select this symbol if it provides all of these things: - syscall_get_arch() @@ -458,6 +462,23 @@ config HAVE_ARCH_SECCOMP_FILTER results in the system call being skipped immediately. - seccomp syscall wired up +config SECCOMP + def_bool y + depends on HAVE_ARCH_SECCOMP + prompt "Enable seccomp to safely compute untrusted bytecode" + help + This kernel feature is useful for number crunching applications + that may need to compute untrusted bytecode during their + execution. By using pipes or other transports made available to + the process as file descriptors supporting the read/write + syscalls, it's possible to isolate those applications in + their own address space using seccomp. Once seccomp is + enabled via prctl(PR_SET_SECCOMP), it cannot be disabled + and the task is only allowed to execute a few safe syscalls + defined by each seccomp mode. + + If unsure, say Y. Only embedded should say N here. + config SECCOMP_FILTER def_bool y depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index e00d94b16658..e26c19a16284 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -67,6 +67,7 @@ config ARM select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_MMAP_RND_BITS if MMU + select HAVE_ARCH_SECCOMP select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK @@ -1617,20 +1618,6 @@ config UACCESS_WITH_MEMCPY However, if the CPU data cache is using a write-allocate mode, this option is unlikely to provide any performance gain. -config SECCOMP - bool - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config PARAVIRT bool "Enable paravirtualization code" help diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 6d232837cbee..98c4e34cbec1 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1033,19 +1033,6 @@ config ARCH_ENABLE_SPLIT_PMD_PTLOCK config CC_HAVE_SHADOW_CALL_STACK def_bool $(cc-option, -fsanitize=shadow-call-stack -ffixed-x18) -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config PARAVIRT bool "Enable paravirtualization code" help diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 3d5afb5f5685..7f424c85772c 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -309,16 +309,3 @@ endmenu source "arch/csky/Kconfig.platforms" source "kernel/Kconfig.hz" - -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig index d262ac0c8714..37bd6a5f38fb 100644 --- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -26,6 +26,7 @@ config MICROBLAZE select GENERIC_SCHED_CLOCK select HAVE_ARCH_HASH select HAVE_ARCH_KGDB + select HAVE_ARCH_SECCOMP select HAVE_DEBUG_KMEMLEAK select HAVE_DMA_CONTIGUOUS select HAVE_DYNAMIC_FTRACE @@ -120,23 +121,6 @@ config CMDLINE_FORCE Set this to have arguments from the default kernel command string override those passed by the boot loader. -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - endmenu menu "Kernel features" diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index c95fa3a2484c..5f88a8fc11fc 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -3004,23 +3004,6 @@ config PHYSICAL_START specified in the "crashkernel=YM@XM" command line boot parameter passed to the panic-ed kernel). -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - config MIPS_O32_FP64_SUPPORT bool "Support for O32 binaries using 64-bit FP" if !CPU_MIPSR6 depends on 32BIT || MIPS32_O32 diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 3b0f53dd70bc..cd4afe1e7a6c 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -378,19 +378,3 @@ endmenu source "drivers/parisc/Kconfig" - -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1f48bbfb3ce9..136fe860caef 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -934,23 +934,6 @@ config ARCH_WANTS_FREEZER_CONTROL source "kernel/power/Kconfig" -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - config PPC_MEM_KEYS prompt "PowerPC Memory Protection Keys" def_bool y diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index df18372861d8..c456b558fab9 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -333,19 +333,6 @@ menu "Kernel features" source "kernel/Kconfig.hz" -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config RISCV_SBI_V01 bool "SBI v0.1 support" default y diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 3d86e12e8e3c..7f7b40ec699e 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -791,23 +791,6 @@ config CRASH_DUMP endmenu -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. - config CCW def_bool y diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index d20927128fce..18278152c91c 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -600,22 +600,6 @@ config PHYSICAL_START where the fail safe kernel needs to run at a different address than the panic-ed kernel. -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl, it cannot be disabled and the task is only - allowed to execute a few safe syscalls defined by each seccomp - mode. - - If unsure, say N. - config SMP bool "Symmetric multi-processing support" depends on SYS_SUPPORTS_SMP diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index efeff2c896a5..d62ce83cf009 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -23,6 +23,7 @@ config SPARC select HAVE_OPROFILE select HAVE_ARCH_KGDB if !SMP || SPARC64 select HAVE_ARCH_TRACEHOOK + select HAVE_ARCH_SECCOMP if SPARC64 select HAVE_EXIT_THREAD select HAVE_PCI select SYSCTL_EXCEPTION_TRACE @@ -226,23 +227,6 @@ config EARLYFB help Say Y here to enable a faster early framebuffer boot console. -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on SPARC64 && PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - config HOTPLUG_CPU bool "Support for hot-pluggable CPUs" depends on SPARC64 && SMP diff --git a/arch/um/Kconfig b/arch/um/Kconfig index eb51fec75948..d49f471b02e3 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -173,22 +173,6 @@ config PGTABLE_LEVELS default 3 if 3_LEVEL_PGTABLES default 2 -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. - config UML_TIME_TRAVEL_SUPPORT bool prompt "Support time-travel mode (e.g. for test execution)" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7101ac64bb20..1ab22869a765 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1968,22 +1968,6 @@ config EFI_MIXED If unsure, say N. -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - source "kernel/Kconfig.hz" config KEXEC diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index e997e0119c02..d8a29dc5a284 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -217,20 +217,6 @@ config HOTPLUG_CPU Say N if you want to disable CPU hotplug. -config SECCOMP - bool - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config FAST_SYSCALL_XTENSA bool "Enable fast atomic syscalls" default n -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig 2020-09-24 12:06 ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu @ 2020-09-24 12:06 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> In order to make adding configurable features into seccomp easier, it's better to have the options at one single location, considering easpecially that the bulk of seccomp code is arch-independent. An quick look also show that many SECCOMP descriptions are outdated; they talk about /proc rather than prctl. As a result of moving the config option and keeping it default on, architectures arm, arm64, csky, riscv, sh, and xtensa did not have SECCOMP on by default prior to this and SECCOMP will be default in this change. Architectures microblaze, mips, powerpc, s390, sh, and sparc have an outdated depend on PROC_FS and this dependency is removed in this change. Suggested-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/Kconfig | 21 +++++++++++++++++++++ arch/arm/Kconfig | 15 +-------------- arch/arm64/Kconfig | 13 ------------- arch/csky/Kconfig | 13 ------------- arch/microblaze/Kconfig | 18 +----------------- arch/mips/Kconfig | 17 ----------------- arch/parisc/Kconfig | 16 ---------------- arch/powerpc/Kconfig | 17 ----------------- arch/riscv/Kconfig | 13 ------------- arch/s390/Kconfig | 17 ----------------- arch/sh/Kconfig | 16 ---------------- arch/sparc/Kconfig | 18 +----------------- arch/um/Kconfig | 16 ---------------- arch/x86/Kconfig | 16 ---------------- arch/xtensa/Kconfig | 14 -------------- 15 files changed, 24 insertions(+), 216 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index af14a567b493..6dfc5673215d 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -444,8 +444,12 @@ config ARCH_WANT_OLD_COMPAT_IPC select ARCH_WANT_COMPAT_IPC_PARSE_VERSION bool +config HAVE_ARCH_SECCOMP + bool + config HAVE_ARCH_SECCOMP_FILTER bool + select HAVE_ARCH_SECCOMP help An arch should select this symbol if it provides all of these things: - syscall_get_arch() @@ -458,6 +462,23 @@ config HAVE_ARCH_SECCOMP_FILTER results in the system call being skipped immediately. - seccomp syscall wired up +config SECCOMP + def_bool y + depends on HAVE_ARCH_SECCOMP + prompt "Enable seccomp to safely compute untrusted bytecode" + help + This kernel feature is useful for number crunching applications + that may need to compute untrusted bytecode during their + execution. By using pipes or other transports made available to + the process as file descriptors supporting the read/write + syscalls, it's possible to isolate those applications in + their own address space using seccomp. Once seccomp is + enabled via prctl(PR_SET_SECCOMP), it cannot be disabled + and the task is only allowed to execute a few safe syscalls + defined by each seccomp mode. + + If unsure, say Y. Only embedded should say N here. + config SECCOMP_FILTER def_bool y depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index e00d94b16658..e26c19a16284 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -67,6 +67,7 @@ config ARM select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_MMAP_RND_BITS if MMU + select HAVE_ARCH_SECCOMP select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK @@ -1617,20 +1618,6 @@ config UACCESS_WITH_MEMCPY However, if the CPU data cache is using a write-allocate mode, this option is unlikely to provide any performance gain. -config SECCOMP - bool - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config PARAVIRT bool "Enable paravirtualization code" help diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 6d232837cbee..98c4e34cbec1 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1033,19 +1033,6 @@ config ARCH_ENABLE_SPLIT_PMD_PTLOCK config CC_HAVE_SHADOW_CALL_STACK def_bool $(cc-option, -fsanitize=shadow-call-stack -ffixed-x18) -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config PARAVIRT bool "Enable paravirtualization code" help diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 3d5afb5f5685..7f424c85772c 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -309,16 +309,3 @@ endmenu source "arch/csky/Kconfig.platforms" source "kernel/Kconfig.hz" - -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig index d262ac0c8714..37bd6a5f38fb 100644 --- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -26,6 +26,7 @@ config MICROBLAZE select GENERIC_SCHED_CLOCK select HAVE_ARCH_HASH select HAVE_ARCH_KGDB + select HAVE_ARCH_SECCOMP select HAVE_DEBUG_KMEMLEAK select HAVE_DMA_CONTIGUOUS select HAVE_DYNAMIC_FTRACE @@ -120,23 +121,6 @@ config CMDLINE_FORCE Set this to have arguments from the default kernel command string override those passed by the boot loader. -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - endmenu menu "Kernel features" diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index c95fa3a2484c..5f88a8fc11fc 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -3004,23 +3004,6 @@ config PHYSICAL_START specified in the "crashkernel=YM@XM" command line boot parameter passed to the panic-ed kernel). -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - config MIPS_O32_FP64_SUPPORT bool "Support for O32 binaries using 64-bit FP" if !CPU_MIPSR6 depends on 32BIT || MIPS32_O32 diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 3b0f53dd70bc..cd4afe1e7a6c 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -378,19 +378,3 @@ endmenu source "drivers/parisc/Kconfig" - -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1f48bbfb3ce9..136fe860caef 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -934,23 +934,6 @@ config ARCH_WANTS_FREEZER_CONTROL source "kernel/power/Kconfig" -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - config PPC_MEM_KEYS prompt "PowerPC Memory Protection Keys" def_bool y diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index df18372861d8..c456b558fab9 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -333,19 +333,6 @@ menu "Kernel features" source "kernel/Kconfig.hz" -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config RISCV_SBI_V01 bool "SBI v0.1 support" default y diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 3d86e12e8e3c..7f7b40ec699e 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -791,23 +791,6 @@ config CRASH_DUMP endmenu -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. - config CCW def_bool y diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index d20927128fce..18278152c91c 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -600,22 +600,6 @@ config PHYSICAL_START where the fail safe kernel needs to run at a different address than the panic-ed kernel. -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl, it cannot be disabled and the task is only - allowed to execute a few safe syscalls defined by each seccomp - mode. - - If unsure, say N. - config SMP bool "Symmetric multi-processing support" depends on SYS_SUPPORTS_SMP diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index efeff2c896a5..d62ce83cf009 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -23,6 +23,7 @@ config SPARC select HAVE_OPROFILE select HAVE_ARCH_KGDB if !SMP || SPARC64 select HAVE_ARCH_TRACEHOOK + select HAVE_ARCH_SECCOMP if SPARC64 select HAVE_EXIT_THREAD select HAVE_PCI select SYSCTL_EXCEPTION_TRACE @@ -226,23 +227,6 @@ config EARLYFB help Say Y here to enable a faster early framebuffer boot console. -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on SPARC64 && PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - config HOTPLUG_CPU bool "Support for hot-pluggable CPUs" depends on SPARC64 && SMP diff --git a/arch/um/Kconfig b/arch/um/Kconfig index eb51fec75948..d49f471b02e3 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -173,22 +173,6 @@ config PGTABLE_LEVELS default 3 if 3_LEVEL_PGTABLES default 2 -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. - config UML_TIME_TRAVEL_SUPPORT bool prompt "Support time-travel mode (e.g. for test execution)" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7101ac64bb20..1ab22869a765 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1968,22 +1968,6 @@ config EFI_MIXED If unsure, say N. -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - source "kernel/Kconfig.hz" config KEXEC diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index e997e0119c02..d8a29dc5a284 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -217,20 +217,6 @@ config HOTPLUG_CPU Say N if you want to disable CPU hotplug. -config SECCOMP - bool - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config FAST_SYSCALL_XTENSA bool "Enable fast atomic syscalls" default n -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* [PATCH seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu @ 2020-09-24 12:06 ` YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu ` (4 subsequent siblings) 6 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Seccomp cache emulator needs to know all the architecture numbers that syscall_get_arch() could return for the kernel build in order to generate a cache for all of them. The array is declared in header as static __maybe_unused const to maximize compiler optimiation opportunities such as loop unrolling. Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/alpha/include/asm/syscall.h | 4 ++++ arch/arc/include/asm/syscall.h | 24 +++++++++++++++++++----- arch/arm/include/asm/syscall.h | 4 ++++ arch/arm64/include/asm/syscall.h | 4 ++++ arch/c6x/include/asm/syscall.h | 13 +++++++++++-- arch/csky/include/asm/syscall.h | 4 ++++ arch/h8300/include/asm/syscall.h | 4 ++++ arch/hexagon/include/asm/syscall.h | 4 ++++ arch/ia64/include/asm/syscall.h | 4 ++++ arch/m68k/include/asm/syscall.h | 4 ++++ arch/microblaze/include/asm/syscall.h | 4 ++++ arch/mips/include/asm/syscall.h | 16 ++++++++++++++++ arch/nds32/include/asm/syscall.h | 13 +++++++++++-- arch/nios2/include/asm/syscall.h | 4 ++++ arch/openrisc/include/asm/syscall.h | 4 ++++ arch/parisc/include/asm/syscall.h | 7 +++++++ arch/powerpc/include/asm/syscall.h | 14 ++++++++++++++ arch/riscv/include/asm/syscall.h | 14 ++++++++++---- arch/s390/include/asm/syscall.h | 7 +++++++ arch/sh/include/asm/syscall_32.h | 17 +++++++++++------ arch/sparc/include/asm/syscall.h | 9 +++++++++ arch/x86/include/asm/syscall.h | 11 +++++++++++ arch/x86/um/asm/syscall.h | 14 ++++++++++---- arch/xtensa/include/asm/syscall.h | 4 ++++ 24 files changed, 184 insertions(+), 23 deletions(-) diff --git a/arch/alpha/include/asm/syscall.h b/arch/alpha/include/asm/syscall.h index 11c688c1d7ec..625ac9b23f37 100644 --- a/arch/alpha/include/asm/syscall.h +++ b/arch/alpha/include/asm/syscall.h @@ -4,6 +4,10 @@ #include <uapi/linux/audit.h> +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_ALPHA +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_ALPHA; diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h index 94529e89dff0..899c13cbf5cc 100644 --- a/arch/arc/include/asm/syscall.h +++ b/arch/arc/include/asm/syscall.h @@ -65,14 +65,28 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, } } +#ifdef CONFIG_ISA_ARCOMPACT +# ifdef CONFIG_CPU_BIG_ENDIAN +# define SYSCALL_ARCH AUDIT_ARCH_ARCOMPACTBE +# else +# define SYSCALL_ARCH AUDIT_ARCH_ARCOMPACT +# endif /* CONFIG_CPU_BIG_ENDIAN */ +#else +# ifdef CONFIG_CPU_BIG_ENDIAN +# define SYSCALL_ARCH AUDIT_ARCH_ARCV2BE +# else +# define SYSCALL_ARCH AUDIT_ARCH_ARCV2 +# endif /* CONFIG_CPU_BIG_ENDIAN */ +#endif /* CONFIG_ISA_ARCOMPACT */ + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + static inline int syscall_get_arch(struct task_struct *task) { - return IS_ENABLED(CONFIG_ISA_ARCOMPACT) - ? (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) - ? AUDIT_ARCH_ARCOMPACTBE : AUDIT_ARCH_ARCOMPACT) - : (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) - ? AUDIT_ARCH_ARCV2BE : AUDIT_ARCH_ARCV2); + return SYSCALL_ARCH; } #endif diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h index fd02761ba06c..33ade26e3956 100644 --- a/arch/arm/include/asm/syscall.h +++ b/arch/arm/include/asm/syscall.h @@ -73,6 +73,10 @@ static inline void syscall_set_arguments(struct task_struct *task, memcpy(®s->ARM_r0 + 1, args, 5 * sizeof(args[0])); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_ARM +}; + static inline int syscall_get_arch(struct task_struct *task) { /* ARM tasks don't change audit architectures on the fly. */ diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h index cfc0672013f6..77f3d300e7a0 100644 --- a/arch/arm64/include/asm/syscall.h +++ b/arch/arm64/include/asm/syscall.h @@ -82,6 +82,10 @@ static inline void syscall_set_arguments(struct task_struct *task, memcpy(®s->regs[1], args, 5 * sizeof(args[0])); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_ARM, AUDIT_ARCH_AARCH64 +}; + /* * We don't care about endianness (__AUDIT_ARCH_LE bit) here because * AArch64 has the same system calls both on little- and big- endian. diff --git a/arch/c6x/include/asm/syscall.h b/arch/c6x/include/asm/syscall.h index 38f3e2284ecd..0d78c67ee1fc 100644 --- a/arch/c6x/include/asm/syscall.h +++ b/arch/c6x/include/asm/syscall.h @@ -66,10 +66,19 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->a9 = *args; } +#ifdef CONFIG_CPU_BIG_ENDIAN +#define SYSCALL_ARCH AUDIT_ARCH_C6XBE +#else +#define SYSCALL_ARCH AUDIT_ARCH_C6X +#endif + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + static inline int syscall_get_arch(struct task_struct *task) { - return IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) - ? AUDIT_ARCH_C6XBE : AUDIT_ARCH_C6X; + return SYSCALL_ARCH; } #endif /* __ASM_C6X_SYSCALLS_H */ diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h index f624fa3bbc22..86242d2850d7 100644 --- a/arch/csky/include/asm/syscall.h +++ b/arch/csky/include/asm/syscall.h @@ -68,6 +68,10 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs, memcpy(®s->a1, args, 5 * sizeof(regs->a1)); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_CSKY +}; + static inline int syscall_get_arch(struct task_struct *task) { diff --git a/arch/h8300/include/asm/syscall.h b/arch/h8300/include/asm/syscall.h index 01666b8bb263..775f6ac8fde3 100644 --- a/arch/h8300/include/asm/syscall.h +++ b/arch/h8300/include/asm/syscall.h @@ -28,6 +28,10 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, *args = regs->er6; } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_H8300 +}; + static inline int syscall_get_arch(struct task_struct *task) { diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h index f6e454f18038..6ee21a76f6a3 100644 --- a/arch/hexagon/include/asm/syscall.h +++ b/arch/hexagon/include/asm/syscall.h @@ -45,6 +45,10 @@ static inline long syscall_get_return_value(struct task_struct *task, return regs->r00; } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_HEXAGON +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_HEXAGON; diff --git a/arch/ia64/include/asm/syscall.h b/arch/ia64/include/asm/syscall.h index 6c6f16e409a8..19456125c89a 100644 --- a/arch/ia64/include/asm/syscall.h +++ b/arch/ia64/include/asm/syscall.h @@ -71,6 +71,10 @@ static inline void syscall_set_arguments(struct task_struct *task, ia64_syscall_get_set_arguments(task, regs, args, 1); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_IA64 +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_IA64; diff --git a/arch/m68k/include/asm/syscall.h b/arch/m68k/include/asm/syscall.h index 465ac039be09..031b051f9026 100644 --- a/arch/m68k/include/asm/syscall.h +++ b/arch/m68k/include/asm/syscall.h @@ -4,6 +4,10 @@ #include <uapi/linux/audit.h> +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_M68K +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_M68K; diff --git a/arch/microblaze/include/asm/syscall.h b/arch/microblaze/include/asm/syscall.h index 3a6924f3cbde..28cde14056d1 100644 --- a/arch/microblaze/include/asm/syscall.h +++ b/arch/microblaze/include/asm/syscall.h @@ -105,6 +105,10 @@ static inline void syscall_set_arguments(struct task_struct *task, asmlinkage unsigned long do_syscall_trace_enter(struct pt_regs *regs); asmlinkage void do_syscall_trace_leave(struct pt_regs *regs); +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_MICROBLAZE +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_MICROBLAZE; diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h index 25fa651c937d..29e4c1c47c54 100644 --- a/arch/mips/include/asm/syscall.h +++ b/arch/mips/include/asm/syscall.h @@ -140,6 +140,22 @@ extern const unsigned long sys_call_table[]; extern const unsigned long sys32_call_table[]; extern const unsigned long sysn32_call_table[]; +static __maybe_unused const int syscall_arches[] = { +#ifdef __LITTLE_ENDIAN + AUDIT_ARCH_MIPSEL, +# ifdef CONFIG_64BIT + AUDIT_ARCH_MIPSEL64, + AUDIT_ARCH_MIPSEL64N32, +# endif /* CONFIG_64BIT */ +#else + AUDIT_ARCH_MIPS, +# ifdef CONFIG_64BIT + AUDIT_ARCH_MIPS64, + AUDIT_ARCH_MIPS64N32, +# endif /* CONFIG_64BIT */ +#endif /* __LITTLE_ENDIAN */ +}; + static inline int syscall_get_arch(struct task_struct *task) { int arch = AUDIT_ARCH_MIPS; diff --git a/arch/nds32/include/asm/syscall.h b/arch/nds32/include/asm/syscall.h index 7b5180d78e20..2dd5e33bcfcb 100644 --- a/arch/nds32/include/asm/syscall.h +++ b/arch/nds32/include/asm/syscall.h @@ -154,11 +154,20 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs, memcpy(®s->uregs[0] + 1, args, 5 * sizeof(args[0])); } +#ifdef CONFIG_CPU_BIG_ENDIAN +#define SYSCALL_ARCH AUDIT_ARCH_NDS32BE +#else +#define SYSCALL_ARCH AUDIT_ARCH_NDS32 +#endif + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + static inline int syscall_get_arch(struct task_struct *task) { - return IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) - ? AUDIT_ARCH_NDS32BE : AUDIT_ARCH_NDS32; + return SYSCALL_ARCH; } #endif /* _ASM_NDS32_SYSCALL_H */ diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h index 526449edd768..8fa2716cac5a 100644 --- a/arch/nios2/include/asm/syscall.h +++ b/arch/nios2/include/asm/syscall.h @@ -69,6 +69,10 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->r9 = *args; } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_NIOS2 +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_NIOS2; diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h index e6383be2a195..4eb28ad08042 100644 --- a/arch/openrisc/include/asm/syscall.h +++ b/arch/openrisc/include/asm/syscall.h @@ -64,6 +64,10 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs, memcpy(®s->gpr[3], args, 6 * sizeof(args[0])); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_OPENRISC +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_OPENRISC; diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h index 00b127a5e09b..2915f140c9fd 100644 --- a/arch/parisc/include/asm/syscall.h +++ b/arch/parisc/include/asm/syscall.h @@ -55,6 +55,13 @@ static inline void syscall_rollback(struct task_struct *task, /* do nothing */ } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_PARISC, +#ifdef CONFIG_64BIT + AUDIT_ARCH_PARISC64, +#endif +}; + static inline int syscall_get_arch(struct task_struct *task) { int arch = AUDIT_ARCH_PARISC; diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h index fd1b518eed17..781deb211e3d 100644 --- a/arch/powerpc/include/asm/syscall.h +++ b/arch/powerpc/include/asm/syscall.h @@ -104,6 +104,20 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->orig_gpr3 = args[0]; } +static __maybe_unused const int syscall_arches[] = { +#ifdef __LITTLE_ENDIAN__ + AUDIT_ARCH_PPC | __AUDIT_ARCH_LE, +# ifdef CONFIG_PPC64 + AUDIT_ARCH_PPC64LE, +# endif /* CONFIG_PPC64 */ +#else + AUDIT_ARCH_PPC, +# ifdef CONFIG_PPC64 + AUDIT_ARCH_PPC64, +# endif /* CONFIG_PPC64 */ +#endif /* __LITTLE_ENDIAN__ */ +}; + static inline int syscall_get_arch(struct task_struct *task) { int arch; diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h index 49350c8bd7b0..4b36d358243e 100644 --- a/arch/riscv/include/asm/syscall.h +++ b/arch/riscv/include/asm/syscall.h @@ -73,13 +73,19 @@ static inline void syscall_set_arguments(struct task_struct *task, memcpy(®s->a1, args, 5 * sizeof(regs->a1)); } -static inline int syscall_get_arch(struct task_struct *task) -{ #ifdef CONFIG_64BIT - return AUDIT_ARCH_RISCV64; +#define SYSCALL_ARCH AUDIT_ARCH_RISCV64 #else - return AUDIT_ARCH_RISCV32; +#define SYSCALL_ARCH AUDIT_ARCH_RISCV32 #endif + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + +static inline int syscall_get_arch(struct task_struct *task) +{ + return SYSCALL_ARCH; } #endif /* _ASM_RISCV_SYSCALL_H */ diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h index d9d5de0f67ff..4cb9da36610a 100644 --- a/arch/s390/include/asm/syscall.h +++ b/arch/s390/include/asm/syscall.h @@ -89,6 +89,13 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->orig_gpr2 = args[0]; } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_S390X, +#ifdef CONFIG_COMPAT + AUDIT_ARCH_S390, +#endif +}; + static inline int syscall_get_arch(struct task_struct *task) { #ifdef CONFIG_COMPAT diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h index cb51a7528384..4780f2339c72 100644 --- a/arch/sh/include/asm/syscall_32.h +++ b/arch/sh/include/asm/syscall_32.h @@ -69,13 +69,18 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->regs[4] = args[0]; } -static inline int syscall_get_arch(struct task_struct *task) -{ - int arch = AUDIT_ARCH_SH; - #ifdef CONFIG_CPU_LITTLE_ENDIAN - arch |= __AUDIT_ARCH_LE; +#define SYSCALL_ARCH AUDIT_ARCH_SHEL +#else +#define SYSCALL_ARCH AUDIT_ARCH_SH #endif - return arch; + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + +static inline int syscall_get_arch(struct task_struct *task) +{ + return SYSCALL_ARCH; } #endif /* __ASM_SH_SYSCALL_32_H */ diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h index 62a5a78804c4..a458992cdcfe 100644 --- a/arch/sparc/include/asm/syscall.h +++ b/arch/sparc/include/asm/syscall.h @@ -127,6 +127,15 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->u_regs[UREG_I0 + i] = args[i]; } +static __maybe_unused const int syscall_arches[] = { +#ifdef CONFIG_SPARC64 + AUDIT_ARCH_SPARC64, +#endif +#if !defined(CONFIG_SPARC64) || defined(CONFIG_COMPAT) + AUDIT_ARCH_SPARC, +#endif +}; + static inline int syscall_get_arch(struct task_struct *task) { #if defined(CONFIG_SPARC64) && defined(CONFIG_COMPAT) diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index 7cbf733d11af..e13bb2a65b6f 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -97,6 +97,10 @@ static inline void syscall_set_arguments(struct task_struct *task, memcpy(®s->bx + i, args, n * sizeof(args[0])); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_I386 +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_I386; @@ -152,6 +156,13 @@ static inline void syscall_set_arguments(struct task_struct *task, } } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_X86_64, +#ifdef CONFIG_IA32_EMULATION + AUDIT_ARCH_I386, +#endif +}; + static inline int syscall_get_arch(struct task_struct *task) { /* x32 tasks should be considered AUDIT_ARCH_X86_64. */ diff --git a/arch/x86/um/asm/syscall.h b/arch/x86/um/asm/syscall.h index 56a2f0913e3c..590a31e22b99 100644 --- a/arch/x86/um/asm/syscall.h +++ b/arch/x86/um/asm/syscall.h @@ -9,13 +9,19 @@ typedef asmlinkage long (*sys_call_ptr_t)(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); -static inline int syscall_get_arch(struct task_struct *task) -{ #ifdef CONFIG_X86_32 - return AUDIT_ARCH_I386; +#define SYSCALL_ARCH AUDIT_ARCH_I386 #else - return AUDIT_ARCH_X86_64; +#define SYSCALL_ARCH AUDIT_ARCH_X86_64 #endif + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + +static inline int syscall_get_arch(struct task_struct *task) +{ + return SYSCALL_ARCH; } #endif /* __UM_ASM_SYSCALL_H */ diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h index f9a671cbf933..3d334fb0d329 100644 --- a/arch/xtensa/include/asm/syscall.h +++ b/arch/xtensa/include/asm/syscall.h @@ -14,6 +14,10 @@ #include <asm/ptrace.h> #include <uapi/linux/audit.h> +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_XTENSA +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_XTENSA; -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* [PATCH seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu @ 2020-09-24 12:06 ` YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu ` (3 subsequent siblings) 6 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not access any syscall arguments or instruction pointer. To facilitate this we need a static analyser to know whether a filter will return allow regardless of syscall arguments for a given architecture number / syscall number pair. This is implemented here with a pseudo-emulator, and stored in a per-filter bitmap. Each common BPF instruction (stolen from Kees's list [1]) are emulated. Any weirdness or loading from a syscall argument will cause the emulator to bail. The emulation is also halted if it reaches a return. In that case, if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. Filter dependency is resolved at attach time. If a filter depends on more filters, then we perform an and on its bitmask against its dependee; if the dependee does not guarantee to allow the syscall, then the depender is also marked not to guarantee to allow the syscall. [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/Kconfig | 25 ++++++ kernel/seccomp.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 220 insertions(+), 1 deletion(-) diff --git a/arch/Kconfig b/arch/Kconfig index 6dfc5673215d..8cc3dc87f253 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -489,6 +489,31 @@ config SECCOMP_FILTER See Documentation/userspace-api/seccomp_filter.rst for details. +choice + prompt "Seccomp filter cache" + default SECCOMP_CACHE_NONE + depends on SECCOMP_FILTER + help + Seccomp filters can potentially incur large overhead for each + system call. This can alleviate some of the overhead. + + If in doubt, select 'syscall numbers only'. + +config SECCOMP_CACHE_NONE + bool "None" + help + No caching is done. Seccomp filters will be called each time + a system call occurs in a seccomp-guarded task. + +config SECCOMP_CACHE_NR_ONLY + bool "Syscall number only" + depends on !HAVE_SPARSE_SYSCALL_NR + help + For each syscall number, if the seccomp filter has a fixed + result, store that result in a bitmap to speed up system calls. + +endchoice + config HAVE_ARCH_STACKLEAK bool help diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 3ee59ce0a323..7c286d66f983 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -143,6 +143,32 @@ struct notification { struct list_head notifications; }; +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * struct seccomp_cache_filter_data - container for cache's per-filter data + * + * @syscall_ok: A bitmap for each architecture number, where each bit + * represents whether the filter will always allow the syscall. + */ +struct seccomp_cache_filter_data { + DECLARE_BITMAP(syscall_ok[ARRAY_SIZE(syscall_arches)], NR_syscalls); +}; + +#define SECCOMP_EMU_MAX_PENDING_STATES 64 +#else +struct seccomp_cache_filter_data { }; + +static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ + return 0; +} + +static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter, + const struct seccomp_filter *prev) +{ +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * struct seccomp_filter - container for seccomp BPF programs * @@ -185,6 +211,7 @@ struct seccomp_filter { struct notification *notif; struct mutex notify_lock; wait_queue_head_t wqh; + struct seccomp_cache_filter_data cache; }; /* Limit any path through the tree to 256KB worth of instructions. */ @@ -530,6 +557,139 @@ static inline void seccomp_sync_threads(unsigned long flags) } } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * struct seccomp_emu_env - container for seccomp emulator environment + * + * @filter: The cBPF filter instructions. + * @nr: The syscall number we are emulating. + * @arch: The architecture number we are emulating. + * @syscall_ok: Emulation result, whether it is okay for seccomp to cache the + * syscall. + */ +struct seccomp_emu_env { + struct sock_filter *filter; + int arch; + int nr; + bool syscall_ok; +}; + +/** + * struct seccomp_emu_state - container for seccomp emulator state + * + * @next: The next pending state. This structure is a linked list. + * @pc: The current program counter. + * @areg: the value of that A register. + */ +struct seccomp_emu_state { + struct seccomp_emu_state *next; + int pc; + u32 areg; +}; + +/** + * seccomp_emu_step - step one instruction in the emulator + * @env: The emulator environment + * @state: The emulator state + * + * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred. + */ +static int seccomp_emu_step(struct seccomp_emu_env *env, + struct seccomp_emu_state *state) +{ + struct sock_filter *ftest = &env->filter[state->pc++]; + u16 code = ftest->code; + u32 k = ftest->k; + bool compare; + + switch (code) { + case BPF_LD | BPF_W | BPF_ABS: + if (k == offsetof(struct seccomp_data, nr)) + state->areg = env->nr; + else if (k == offsetof(struct seccomp_data, arch)) + state->areg = env->arch; + else + return 1; + + return 0; + case BPF_JMP | BPF_JA: + state->pc += k; + return 0; + case BPF_JMP | BPF_JEQ | BPF_K: + case BPF_JMP | BPF_JGE | BPF_K: + case BPF_JMP | BPF_JGT | BPF_K: + case BPF_JMP | BPF_JSET | BPF_K: + switch (BPF_OP(code)) { + case BPF_JEQ: + compare = state->areg == k; + break; + case BPF_JGT: + compare = state->areg > k; + break; + case BPF_JGE: + compare = state->areg >= k; + break; + case BPF_JSET: + compare = state->areg & k; + break; + default: + WARN_ON(true); + return -EINVAL; + } + + state->pc += compare ? ftest->jt : ftest->jf; + return 0; + case BPF_ALU | BPF_AND | BPF_K: + state->areg &= k; + return 0; + case BPF_RET | BPF_K: + env->syscall_ok = k == SECCOMP_RET_ALLOW; + return 1; + default: + return 1; + } +} + +/** + * seccomp_cache_prepare - emulate the filter to find cachable syscalls + * @sfilter: The seccomp filter + * + * Returns 0 if successful or -errno if error occurred. + */ +int seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; + struct sock_filter *filter = fprog->filter; + int arch, nr, res = 0; + + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { + for (nr = 0; nr < NR_syscalls; nr++) { + struct seccomp_emu_env env = {0}; + struct seccomp_emu_state state = {0}; + + env.filter = filter; + env.arch = syscall_arches[arch]; + env.nr = nr; + + while (true) { + res = seccomp_emu_step(&env, &state); + if (res) + break; + } + + if (res < 0) + goto out; + + if (env.syscall_ok) + set_bit(nr, sfilter->cache.syscall_ok[arch]); + } + } + +out: + return res; +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_prepare_filter: Prepares a seccomp filter for use. * @fprog: BPF program to install @@ -540,7 +700,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) { struct seccomp_filter *sfilter; int ret; - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) return ERR_PTR(-EINVAL); @@ -571,6 +732,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) return ERR_PTR(ret); } + ret = seccomp_cache_prepare(sfilter); + if (ret < 0) { + bpf_prog_destroy(sfilter->prog); + kfree(sfilter); + return ERR_PTR(ret); + } + refcount_set(&sfilter->refs, 1); refcount_set(&sfilter->users, 1); init_waitqueue_head(&sfilter->wqh); @@ -606,6 +774,31 @@ seccomp_prepare_user_filter(const char __user *user_filter) return filter; } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * seccomp_cache_inherit - lookup seccomp cache + * @sfilter: The seccomp filter + * @sd: The seccomp data to lookup the cache with + * + * Returns true if the seccomp_data is cached and allowed. + */ +static void seccomp_cache_inherit(struct seccomp_filter *sfilter, + const struct seccomp_filter *prev) +{ + int arch; + + if (!prev) + return; + + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { + bitmap_and(sfilter->cache.syscall_ok[arch], + sfilter->cache.syscall_ok[arch], + prev->cache.syscall_ok[arch], + NR_syscalls); + } +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_attach_filter: validate and attach filter * @flags: flags to change filter behavior @@ -655,6 +848,7 @@ static long seccomp_attach_filter(unsigned int flags, * task reference. */ filter->prev = current->seccomp.filter; + seccomp_cache_inherit(filter, filter->prev); current->seccomp.filter = filter; atomic_inc(¤t->seccomp.filter_count); -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* [PATCH seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path 2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu ` (2 preceding siblings ...) 2020-09-24 12:06 ` [PATCH seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu @ 2020-09-24 12:06 ` YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu ` (2 subsequent siblings) 6 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> The fast (common) path for seccomp should be that the filter permits the syscall to pass through, and failing seccomp is expected to be an exceptional case; it is not expected for userspace to call a denylisted syscall over and over. This first finds the current allow bitmask by iterating through syscall_arches[] array and comparing it to the one in struct seccomp_data; this loop is expected to be unrolled. It then does a test_bit against the bitmask. If the bit is set, then there is no need to run the full filter; it returns SECCOMP_RET_ALLOW immediately. Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- kernel/seccomp.c | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 7c286d66f983..5b1bd8329e9c 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -167,6 +167,12 @@ static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter, const struct seccomp_filter *prev) { } + +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + return false; +} #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ /** @@ -321,6 +327,34 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) return 0; } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * seccomp_cache_check - lookup seccomp cache + * @sfilter: The seccomp filter + * @sd: The seccomp data to lookup the cache with + * + * Returns true if the seccomp_data is cached and allowed. + */ +static bool seccomp_cache_check(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + int syscall_nr = sd->nr; + int arch; + + if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls)) + return false; + + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { + if (likely(syscall_arches[arch] == sd->arch)) + return test_bit(syscall_nr, + sfilter->cache.syscall_ok[arch]); + } + + WARN_ON_ONCE(true); + return false; +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_run_filters - evaluates all seccomp filters against @sd * @sd: optional seccomp data to be passed to filters @@ -343,6 +377,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, if (WARN_ON(f == NULL)) return SECCOMP_RET_KILL_PROCESS; + if (seccomp_cache_check(f, sd)) + return SECCOMP_RET_ALLOW; + /* * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* [PATCH seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead 2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu ` (3 preceding siblings ...) 2020-09-24 12:06 ` [PATCH seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu @ 2020-09-24 12:06 ` YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu 6 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: Kees Cook <keescook@chromium.org> As part of the seccomp benchmarking, include the expectations with regard to the timing behavior of the constant action bitmaps, and report inconsistencies better. Example output with constant action bitmaps on x86: $ sudo ./seccomp_benchmark 100000000 Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 100000000 syscalls... 63.896255358 - 0.008504529 = 63887750829 (63.9s) getpid native: 638 ns 130.383312423 - 63.897315189 = 66485997234 (66.5s) getpid RET_ALLOW 1 filter (bitmap): 664 ns 196.789080421 - 130.384414983 = 66404665438 (66.4s) getpid RET_ALLOW 2 filters (bitmap): 664 ns 268.844643304 - 196.790234168 = 72054409136 (72.1s) getpid RET_ALLOW 3 filters (full): 720 ns 342.627472515 - 268.845799103 = 73781673412 (73.8s) getpid RET_ALLOW 4 filters (full): 737 ns Estimated total seccomp overhead for 1 bitmapped filter: 26 ns Estimated total seccomp overhead for 2 bitmapped filters: 26 ns Estimated total seccomp overhead for 3 full filters: 82 ns Estimated total seccomp overhead for 4 full filters: 99 ns Estimated seccomp entry overhead: 26 ns Estimated seccomp per-filter overhead (last 2 diff): 17 ns Estimated seccomp per-filter overhead (filters / 4): 18 ns Expectations: native ≤ 1 bitmap (638 ≤ 664): ✔️ native ≤ 1 filter (638 ≤ 720): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️ 1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️ entry ≈ 1 bitmapped (26 ≈ 26): ✔️ entry ≈ 2 bitmapped (26 ≈ 26): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️ Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- .../selftests/seccomp/seccomp_benchmark.c | 151 +++++++++++++++--- tools/testing/selftests/seccomp/settings | 2 +- 2 files changed, 130 insertions(+), 23 deletions(-) diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c index 91f5a89cadac..fcc806585266 100644 --- a/tools/testing/selftests/seccomp/seccomp_benchmark.c +++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c @@ -4,12 +4,16 @@ */ #define _GNU_SOURCE #include <assert.h> +#include <limits.h> +#include <stdbool.h> +#include <stddef.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <linux/filter.h> #include <linux/seccomp.h> +#include <sys/param.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <sys/types.h> @@ -70,18 +74,74 @@ unsigned long long calibrate(void) return samples * seconds; } +bool approx(int i_one, int i_two) +{ + double one = i_one, one_bump = one * 0.01; + double two = i_two, two_bump = two * 0.01; + + one_bump = one + MAX(one_bump, 2.0); + two_bump = two + MAX(two_bump, 2.0); + + /* Equal to, or within 1% or 2 digits */ + if (one == two || + (one > two && one <= two_bump) || + (two > one && two <= one_bump)) + return true; + return false; +} + +bool le(int i_one, int i_two) +{ + if (i_one <= i_two) + return true; + return false; +} + +long compare(const char *name_one, const char *name_eval, const char *name_two, + unsigned long long one, bool (*eval)(int, int), unsigned long long two) +{ + bool good; + + printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two, + (long long)one, name_eval, (long long)two); + if (one > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)one); + return 1; + } + if (two > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)two); + return 1; + } + + good = eval(one, two); + printf("%s\n", good ? "✔️" : "❌"); + + return good ? 0 : 1; +} + int main(int argc, char *argv[]) { + struct sock_filter bitmap_filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), + }; + struct sock_fprog bitmap_prog = { + .len = (unsigned short)ARRAY_SIZE(bitmap_filter), + .filter = bitmap_filter, + }; struct sock_filter filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog prog = { .len = (unsigned short)ARRAY_SIZE(filter), .filter = filter, }; - long ret; - unsigned long long samples; - unsigned long long native, filter1, filter2; + + long ret, bits; + unsigned long long samples, calc; + unsigned long long native, filter1, filter2, bitmap1, bitmap2; + unsigned long long entry, per_filter1, per_filter2; printf("Current BPF sysctl settings:\n"); system("sysctl net.core.bpf_jit_enable"); @@ -101,35 +161,82 @@ int main(int argc, char *argv[]) ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); assert(ret == 0); - /* One filter */ - ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); + /* One filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); assert(ret == 0); - filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1); + bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1); + + /* Second filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - if (filter1 == native) - printf("No overhead measured!? Try running again with more samples.\n"); + bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2); - /* Two filters */ + /* Third filter, can no longer be converted to bitmap */ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); assert(ret == 0); - filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2); - - /* Calculations */ - printf("Estimated total seccomp overhead for 1 filter: %llu ns\n", - filter1 - native); + filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1); - printf("Estimated total seccomp overhead for 2 filters: %llu ns\n", - filter2 - native); + /* Fourth filter, can not be converted to bitmap because of filter 3 */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - printf("Estimated seccomp per-filter overhead: %llu ns\n", - filter2 - filter1); + filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2); + + /* Estimations */ +#define ESTIMATE(fmt, var, what) do { \ + var = (what); \ + printf("Estimated " fmt ": %llu ns\n", var); \ + if (var > INT_MAX) \ + goto more_samples; \ + } while (0) + + ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc, + bitmap1 - native); + ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc, + bitmap2 - native); + ESTIMATE("total seccomp overhead for 3 full filters", calc, + filter1 - native); + ESTIMATE("total seccomp overhead for 4 full filters", calc, + filter2 - native); + ESTIMATE("seccomp entry overhead", entry, + bitmap1 - native - (bitmap2 - bitmap1)); + ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1, + filter2 - filter1); + ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2, + (filter2 - native - entry) / 4); + + printf("Expectations:\n"); + ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1); + bits = compare("native", "≤", "1 filter", native, le, filter1); + if (bits) + goto more_samples; + + ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)", + per_filter1, approx, per_filter2); + + bits = compare("1 bitmapped", "≈", "2 bitmapped", + bitmap1 - native, approx, bitmap2 - native); + if (bits) { + printf("Skipping constant action bitmap expectations: they appear unsupported.\n"); + goto out; + } - printf("Estimated seccomp entry overhead: %llu ns\n", - filter1 - native - (filter2 - filter1)); + ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native); + ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native); + ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total", + entry + (per_filter1 * 4) + native, approx, filter2); + if (ret == 0) + goto out; +more_samples: + printf("Saw unexpected benchmark result. Try running again with more samples?\n"); +out: return 0; } diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings index ba4d85f74cd6..6091b45d226b 100644 --- a/tools/testing/selftests/seccomp/settings +++ b/tools/testing/selftests/seccomp/settings @@ -1 +1 @@ -timeout=90 +timeout=120 -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* [PATCH seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu ` (4 preceding siblings ...) 2020-09-24 12:06 ` [PATCH seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu @ 2020-09-24 12:06 ` YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu 6 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Currently the kernel does not provide an infrastructure to translate architecture numbers to a human-readable name. Translating syscall numbers to syscall names is possible through FTRACE_SYSCALL infrastructure but it does not provide support for compat syscalls. This will create a file for each PID as /proc/pid/seccomp_cache. The file will be empty when no seccomp filters are loaded, or be in the format of: <hex arch number> <decimal syscall number> <ALLOW | FILTER> where ALLOW means the cache is guaranteed to allow the syscall, and filter means the cache will pass the syscall to the BPF filter. For the docker default profile on x86_64 it looks like: c000003e 0 ALLOW c000003e 1 ALLOW c000003e 2 ALLOW c000003e 3 ALLOW [...] c000003e 132 ALLOW c000003e 133 ALLOW c000003e 134 FILTER c000003e 135 FILTER c000003e 136 FILTER c000003e 137 ALLOW c000003e 138 ALLOW c000003e 139 FILTER c000003e 140 ALLOW c000003e 141 ALLOW [...] This file is guarded by CONFIG_PROC_SECCOMP_CACHE with a default of N because I think certain users of seecomp might not want the application to know which syscalls are definitely usable. I'm not sure if adding all the "human readable names" is worthwhile, considering it can be easily done in userspace. Suggested-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/Kconfig | 10 ++++++++++ fs/proc/base.c | 7 +++++-- include/linux/seccomp.h | 5 +++++ kernel/seccomp.c | 26 ++++++++++++++++++++++++++ 4 files changed, 46 insertions(+), 2 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 8cc3dc87f253..dbfd897e5dc0 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -514,6 +514,16 @@ config SECCOMP_CACHE_NR_ONLY endchoice +config PROC_SECCOMP_CACHE + bool "Show seccomp filter cache status in /proc/pid/seccomp_cache" + depends on SECCOMP_CACHE_NR_ONLY + depends on PROC_FS + help + This is enables /proc/pid/seccomp_cache interface to monitor + seccomp cache data. The file format is subject to change. + + If unsure, say N. + config HAVE_ARCH_STACKLEAK bool help diff --git a/fs/proc/base.c b/fs/proc/base.c index 617db4e0faa0..2af626f69fa1 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2615,7 +2615,7 @@ static struct dentry *proc_pident_instantiate(struct dentry *dentry, return d_splice_alias(inode, dentry); } -static struct dentry *proc_pident_lookup(struct inode *dir, +static struct dentry *proc_pident_lookup(struct inode *dir, struct dentry *dentry, const struct pid_entry *p, const struct pid_entry *end) @@ -2815,7 +2815,7 @@ static const struct pid_entry attr_dir_stuff[] = { static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx) { - return proc_pident_readdir(file, ctx, + return proc_pident_readdir(file, ctx, attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff)); } @@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_PROC_PID_ARCH_STATUS ONE("arch_status", S_IRUGO, proc_pid_arch_status), #endif +#ifdef CONFIG_PROC_SECCOMP_CACHE + ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache), +#endif }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 02aef2844c38..3cedec824365 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -121,4 +121,9 @@ static inline long seccomp_get_metadata(struct task_struct *task, return -EINVAL; } #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */ + +#ifdef CONFIG_PROC_SECCOMP_CACHE +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task); +#endif #endif /* _LINUX_SECCOMP_H */ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 5b1bd8329e9c..c5697d9483ae 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -2295,3 +2295,29 @@ static int __init seccomp_sysctl_init(void) device_initcall(seccomp_sysctl_init) #endif /* CONFIG_SYSCTL */ + +#ifdef CONFIG_PROC_SECCOMP_CACHE +/* Currently CONFIG_PROC_SECCOMP_CACHE implies CONFIG_SECCOMP_CACHE_NR_ONLY */ +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task) +{ + struct seccomp_filter *f = READ_ONCE(task->seccomp.filter); + int arch, nr; + + if (!f) + return 0; + + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { + for (nr = 0; nr < NR_syscalls; nr++) { + bool cached = test_bit(nr, f->cache.syscall_ok[arch]); + char *status = cached ? "ALLOW" : "FILTER"; + + seq_printf(m, "%08x %d %s\n", syscall_arches[arch], + nr, status + ); + } + } + + return 0; +} +#endif /* CONFIG_PROC_SECCOMP_CACHE */ -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls 2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu ` (5 preceding siblings ...) 2020-09-24 12:06 ` [PATCH seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu @ 2020-09-24 12:44 ` YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu ` (5 more replies) 6 siblings, 6 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/ Major differences from the linked alternative by Kees: * No x32 special-case handling -- not worth the complexity * No caching of denylist -- not worth the complexity * No seccomp arch pinning -- I think this is an independent feature * The bitmaps are part of the filters rather than the task. * Architectures supported by default through arch number array, except for MIPS with its sparse syscall numbers. * Configurable per-build for future different cache modes. This series adds a bitmap to cache seccomp filter results if the result permits a syscall and is indepenent of syscall arguments. This visibly decreases seccomp overhead for most common seccomp filters with very little memory footprint. The overhead of running Seccomp filters has been part of some past discussions [1][2][3]. Oftentimes, the filters have a large number of instructions that check syscall numbers one by one and jump based on that. Some users chain BPF filters which further enlarge the overhead. A recent work [6] comprehensively measures the Seccomp overhead and shows that the overhead is non-negligible and has a non-trivial impact on application performance. We observed some common filters, such as docker's [4] or systemd's [5], will make most decisions based only on the syscall numbers, and as past discussions considered, a bitmap where each bit represents a syscall makes most sense for these filters. In order to build this bitmap at filter attach time, each filter is emulated for every syscall (under each possible architecture), and checked for any accesses of struct seccomp_data that are not the "arch" nor "nr" (syscall) members. If only "arch" and "nr" are examined, and the program returns allow, then we can be sure that the filter must return allow independent from syscall arguments. When it is concluded that an allow must occur for the given architecture and syscall pair, seccomp will immediately allow the syscall, bypassing further BPF execution. Ongoing work is to further support arguments with fast hash table lookups. We are investigating the performance of doing so [6], and how to best integrate with the existing seccomp infrastructure. Some benchmarks are performed with results in patch 5, copied below: Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 100000000 syscalls... 63.896255358 - 0.008504529 = 63887750829 (63.9s) getpid native: 638 ns 130.383312423 - 63.897315189 = 66485997234 (66.5s) getpid RET_ALLOW 1 filter (bitmap): 664 ns 196.789080421 - 130.384414983 = 66404665438 (66.4s) getpid RET_ALLOW 2 filters (bitmap): 664 ns 268.844643304 - 196.790234168 = 72054409136 (72.1s) getpid RET_ALLOW 3 filters (full): 720 ns 342.627472515 - 268.845799103 = 73781673412 (73.8s) getpid RET_ALLOW 4 filters (full): 737 ns Estimated total seccomp overhead for 1 bitmapped filter: 26 ns Estimated total seccomp overhead for 2 bitmapped filters: 26 ns Estimated total seccomp overhead for 3 full filters: 82 ns Estimated total seccomp overhead for 4 full filters: 99 ns Estimated seccomp entry overhead: 26 ns Estimated seccomp per-filter overhead (last 2 diff): 17 ns Estimated seccomp per-filter overhead (filters / 4): 18 ns Expectations: native ≤ 1 bitmap (638 ≤ 664): ✔️ native ≤ 1 filter (638 ≤ 720): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️ 1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️ entry ≈ 1 bitmapped (26 ≈ 26): ✔️ entry ≈ 2 bitmapped (26 ≈ 26): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️ RFC -> v1: * Config made on by default across all arches that could support it. * Added arch numbers array and emulate filter for each arch number, and have a per-arch bitmap. * Massively simplified the emulator so it would only support the common instructions in Kees's list. * Fixed inheriting bitmap across filters (filter->prev is always NULL during prepare). * Stole the selftest from Kees. * Added a /proc/pid/seccomp_cache by Jann's suggestion. v1 -> v2: * Corrected one outdated function documentation. Patch 1 moves the SECCOMP Kcomfig option to arch/Kconfig. Patch 2 adds a syscall_arches array so the emulator can enumerate it. Patch 3 implements the emulator that finds if a filter must return allow, Patch 4 implements the test_bit against the bitmaps. Patch 5 updates the selftest to better show the new semantics. Patch 6 implements /proc/pid/seccomp_cache. [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ [3] https://github.com/seccomp/libseccomp/issues/116 [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 [6] Draco: Architectural and Operating System Support for System Call Security https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 Kees Cook (1): selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu (5): seccomp: Move config option SECCOMP to arch/Kconfig asm/syscall.h: Add syscall_arches[] array seccomp/cache: Add "emulator" to check if filter is arg-dependent seccomp/cache: Lookup syscall allowlist for fast path seccomp/cache: Report cache data through /proc/pid/seccomp_cache arch/Kconfig | 56 ++++ arch/alpha/include/asm/syscall.h | 4 + arch/arc/include/asm/syscall.h | 24 +- arch/arm/Kconfig | 15 +- arch/arm/include/asm/syscall.h | 4 + arch/arm64/Kconfig | 13 - arch/arm64/include/asm/syscall.h | 4 + arch/c6x/include/asm/syscall.h | 13 +- arch/csky/Kconfig | 13 - arch/csky/include/asm/syscall.h | 4 + arch/h8300/include/asm/syscall.h | 4 + arch/hexagon/include/asm/syscall.h | 4 + arch/ia64/include/asm/syscall.h | 4 + arch/m68k/include/asm/syscall.h | 4 + arch/microblaze/Kconfig | 18 +- arch/microblaze/include/asm/syscall.h | 4 + arch/mips/Kconfig | 17 -- arch/mips/include/asm/syscall.h | 16 ++ arch/nds32/include/asm/syscall.h | 13 +- arch/nios2/include/asm/syscall.h | 4 + arch/openrisc/include/asm/syscall.h | 4 + arch/parisc/Kconfig | 16 -- arch/parisc/include/asm/syscall.h | 7 + arch/powerpc/Kconfig | 17 -- arch/powerpc/include/asm/syscall.h | 14 + arch/riscv/Kconfig | 13 - arch/riscv/include/asm/syscall.h | 14 +- arch/s390/Kconfig | 17 -- arch/s390/include/asm/syscall.h | 7 + arch/sh/Kconfig | 16 -- arch/sh/include/asm/syscall_32.h | 17 +- arch/sparc/Kconfig | 18 +- arch/sparc/include/asm/syscall.h | 9 + arch/um/Kconfig | 16 -- arch/x86/Kconfig | 16 -- arch/x86/include/asm/syscall.h | 11 + arch/x86/um/asm/syscall.h | 14 +- arch/xtensa/Kconfig | 14 - arch/xtensa/include/asm/syscall.h | 4 + fs/proc/base.c | 7 +- include/linux/seccomp.h | 5 + kernel/seccomp.c | 257 +++++++++++++++++- .../selftests/seccomp/seccomp_benchmark.c | 151 ++++++++-- tools/testing/selftests/seccomp/settings | 2 +- 44 files changed, 639 insertions(+), 265 deletions(-) -- 2.28.0 ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig 2020-09-24 12:44 ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu @ 2020-09-24 12:44 ` YiFei Zhu 2020-09-24 19:11 ` Kees Cook 2020-10-27 9:52 ` Geert Uytterhoeven 2020-09-24 12:44 ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu ` (4 subsequent siblings) 5 siblings, 2 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> In order to make adding configurable features into seccomp easier, it's better to have the options at one single location, considering easpecially that the bulk of seccomp code is arch-independent. An quick look also show that many SECCOMP descriptions are outdated; they talk about /proc rather than prctl. As a result of moving the config option and keeping it default on, architectures arm, arm64, csky, riscv, sh, and xtensa did not have SECCOMP on by default prior to this and SECCOMP will be default in this change. Architectures microblaze, mips, powerpc, s390, sh, and sparc have an outdated depend on PROC_FS and this dependency is removed in this change. Suggested-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/Kconfig | 21 +++++++++++++++++++++ arch/arm/Kconfig | 15 +-------------- arch/arm64/Kconfig | 13 ------------- arch/csky/Kconfig | 13 ------------- arch/microblaze/Kconfig | 18 +----------------- arch/mips/Kconfig | 17 ----------------- arch/parisc/Kconfig | 16 ---------------- arch/powerpc/Kconfig | 17 ----------------- arch/riscv/Kconfig | 13 ------------- arch/s390/Kconfig | 17 ----------------- arch/sh/Kconfig | 16 ---------------- arch/sparc/Kconfig | 18 +----------------- arch/um/Kconfig | 16 ---------------- arch/x86/Kconfig | 16 ---------------- arch/xtensa/Kconfig | 14 -------------- 15 files changed, 24 insertions(+), 216 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index af14a567b493..6dfc5673215d 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -444,8 +444,12 @@ config ARCH_WANT_OLD_COMPAT_IPC select ARCH_WANT_COMPAT_IPC_PARSE_VERSION bool +config HAVE_ARCH_SECCOMP + bool + config HAVE_ARCH_SECCOMP_FILTER bool + select HAVE_ARCH_SECCOMP help An arch should select this symbol if it provides all of these things: - syscall_get_arch() @@ -458,6 +462,23 @@ config HAVE_ARCH_SECCOMP_FILTER results in the system call being skipped immediately. - seccomp syscall wired up +config SECCOMP + def_bool y + depends on HAVE_ARCH_SECCOMP + prompt "Enable seccomp to safely compute untrusted bytecode" + help + This kernel feature is useful for number crunching applications + that may need to compute untrusted bytecode during their + execution. By using pipes or other transports made available to + the process as file descriptors supporting the read/write + syscalls, it's possible to isolate those applications in + their own address space using seccomp. Once seccomp is + enabled via prctl(PR_SET_SECCOMP), it cannot be disabled + and the task is only allowed to execute a few safe syscalls + defined by each seccomp mode. + + If unsure, say Y. Only embedded should say N here. + config SECCOMP_FILTER def_bool y depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index e00d94b16658..e26c19a16284 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -67,6 +67,7 @@ config ARM select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_MMAP_RND_BITS if MMU + select HAVE_ARCH_SECCOMP select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK @@ -1617,20 +1618,6 @@ config UACCESS_WITH_MEMCPY However, if the CPU data cache is using a write-allocate mode, this option is unlikely to provide any performance gain. -config SECCOMP - bool - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config PARAVIRT bool "Enable paravirtualization code" help diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 6d232837cbee..98c4e34cbec1 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1033,19 +1033,6 @@ config ARCH_ENABLE_SPLIT_PMD_PTLOCK config CC_HAVE_SHADOW_CALL_STACK def_bool $(cc-option, -fsanitize=shadow-call-stack -ffixed-x18) -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config PARAVIRT bool "Enable paravirtualization code" help diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 3d5afb5f5685..7f424c85772c 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -309,16 +309,3 @@ endmenu source "arch/csky/Kconfig.platforms" source "kernel/Kconfig.hz" - -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig index d262ac0c8714..37bd6a5f38fb 100644 --- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -26,6 +26,7 @@ config MICROBLAZE select GENERIC_SCHED_CLOCK select HAVE_ARCH_HASH select HAVE_ARCH_KGDB + select HAVE_ARCH_SECCOMP select HAVE_DEBUG_KMEMLEAK select HAVE_DMA_CONTIGUOUS select HAVE_DYNAMIC_FTRACE @@ -120,23 +121,6 @@ config CMDLINE_FORCE Set this to have arguments from the default kernel command string override those passed by the boot loader. -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - endmenu menu "Kernel features" diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index c95fa3a2484c..5f88a8fc11fc 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -3004,23 +3004,6 @@ config PHYSICAL_START specified in the "crashkernel=YM@XM" command line boot parameter passed to the panic-ed kernel). -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - config MIPS_O32_FP64_SUPPORT bool "Support for O32 binaries using 64-bit FP" if !CPU_MIPSR6 depends on 32BIT || MIPS32_O32 diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 3b0f53dd70bc..cd4afe1e7a6c 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -378,19 +378,3 @@ endmenu source "drivers/parisc/Kconfig" - -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1f48bbfb3ce9..136fe860caef 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -934,23 +934,6 @@ config ARCH_WANTS_FREEZER_CONTROL source "kernel/power/Kconfig" -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - config PPC_MEM_KEYS prompt "PowerPC Memory Protection Keys" def_bool y diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index df18372861d8..c456b558fab9 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -333,19 +333,6 @@ menu "Kernel features" source "kernel/Kconfig.hz" -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config RISCV_SBI_V01 bool "SBI v0.1 support" default y diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 3d86e12e8e3c..7f7b40ec699e 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -791,23 +791,6 @@ config CRASH_DUMP endmenu -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. - config CCW def_bool y diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index d20927128fce..18278152c91c 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -600,22 +600,6 @@ config PHYSICAL_START where the fail safe kernel needs to run at a different address than the panic-ed kernel. -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on PROC_FS - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl, it cannot be disabled and the task is only - allowed to execute a few safe syscalls defined by each seccomp - mode. - - If unsure, say N. - config SMP bool "Symmetric multi-processing support" depends on SYS_SUPPORTS_SMP diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index efeff2c896a5..d62ce83cf009 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -23,6 +23,7 @@ config SPARC select HAVE_OPROFILE select HAVE_ARCH_KGDB if !SMP || SPARC64 select HAVE_ARCH_TRACEHOOK + select HAVE_ARCH_SECCOMP if SPARC64 select HAVE_EXIT_THREAD select HAVE_PCI select SYSCTL_EXCEPTION_TRACE @@ -226,23 +227,6 @@ config EARLYFB help Say Y here to enable a faster early framebuffer boot console. -config SECCOMP - bool "Enable seccomp to safely compute untrusted bytecode" - depends on SPARC64 && PROC_FS - default y - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via /proc/<pid>/seccomp, it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - config HOTPLUG_CPU bool "Support for hot-pluggable CPUs" depends on SPARC64 && SMP diff --git a/arch/um/Kconfig b/arch/um/Kconfig index eb51fec75948..d49f471b02e3 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -173,22 +173,6 @@ config PGTABLE_LEVELS default 3 if 3_LEVEL_PGTABLES default 2 -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. - config UML_TIME_TRAVEL_SUPPORT bool prompt "Support time-travel mode (e.g. for test execution)" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7101ac64bb20..1ab22869a765 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1968,22 +1968,6 @@ config EFI_MIXED If unsure, say N. -config SECCOMP - def_bool y - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - - If unsure, say Y. Only embedded should say N here. - source "kernel/Kconfig.hz" config KEXEC diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index e997e0119c02..d8a29dc5a284 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -217,20 +217,6 @@ config HOTPLUG_CPU Say N if you want to disable CPU hotplug. -config SECCOMP - bool - prompt "Enable seccomp to safely compute untrusted bytecode" - help - This kernel feature is useful for number crunching applications - that may need to compute untrusted bytecode during their - execution. By using pipes or other transports made available to - the process as file descriptors supporting the read/write - syscalls, it's possible to isolate those applications in - their own address space using seccomp. Once seccomp is - enabled via prctl(PR_SET_SECCOMP), it cannot be disabled - and the task is only allowed to execute a few safe syscalls - defined by each seccomp mode. - config FAST_SYSCALL_XTENSA bool "Enable fast atomic syscalls" default n -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig 2020-09-24 12:44 ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu @ 2020-09-24 19:11 ` Kees Cook 2020-10-27 9:52 ` Geert Uytterhoeven 1 sibling, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-09-24 19:11 UTC (permalink / raw) To: containers, YiFei Zhu Cc: Kees Cook, Tycho Andersen, Valentin Rothberg, Aleksa Sarai, Giuseppe Scrivano, Jann Horn, Tobin Feldman-Fitzthum, Josep Torrellas, Tianyin Xu, Hubertus Franke, linux-kernel, bpf, YiFei Zhu, Dimitrios Skarlatos, Jack Chen, Andrea Arcangeli, Andy Lutomirski, Will Drewry On Thu, 24 Sep 2020 07:44:15 -0500, YiFei Zhu wrote: > In order to make adding configurable features into seccomp > easier, it's better to have the options at one single location, > considering easpecially that the bulk of seccomp code is > arch-independent. An quick look also show that many SECCOMP > descriptions are outdated; they talk about /proc rather than > prctl. > > As a result of moving the config option and keeping it default > on, architectures arm, arm64, csky, riscv, sh, and xtensa > did not have SECCOMP on by default prior to this and SECCOMP will > be default in this change. > > Architectures microblaze, mips, powerpc, s390, sh, and sparc > have an outdated depend on PROC_FS and this dependency is removed > in this change. > > Suggested-by: Jann Horn <jannh@google.com> > Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/ > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > [...] Yes; I've been meaning to do this for a while now. Thank you! I tweaked the help text a bit. Applied, thanks! [1/1] seccomp: Move config option SECCOMP to arch/Kconfig https://git.kernel.org/kees/c/c3c9c2df3636 -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig 2020-09-24 12:44 ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu 2020-09-24 19:11 ` Kees Cook @ 2020-10-27 9:52 ` Geert Uytterhoeven 2020-10-27 19:08 ` YiFei Zhu 2020-10-28 0:06 ` Kees Cook 1 sibling, 2 replies; 149+ messages in thread From: Geert Uytterhoeven @ 2020-10-27 9:52 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, Linux Kernel Mailing List, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry Hi Yifei, On Thu, Sep 24, 2020 at 2:48 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > In order to make adding configurable features into seccomp > easier, it's better to have the options at one single location, > considering easpecially that the bulk of seccomp code is > arch-independent. An quick look also show that many SECCOMP > descriptions are outdated; they talk about /proc rather than > prctl. > > As a result of moving the config option and keeping it default > on, architectures arm, arm64, csky, riscv, sh, and xtensa > did not have SECCOMP on by default prior to this and SECCOMP will > be default in this change. > > Architectures microblaze, mips, powerpc, s390, sh, and sparc > have an outdated depend on PROC_FS and this dependency is removed > in this change. > > Suggested-by: Jann Horn <jannh@google.com> > Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/ > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> Thanks for your patch. which is now commit 282a181b1a0d66de ("seccomp: Move config option SECCOMP to arch/Kconfig") in v5.10-rc1. > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -458,6 +462,23 @@ config HAVE_ARCH_SECCOMP_FILTER > results in the system call being skipped immediately. > - seccomp syscall wired up > > +config SECCOMP > + def_bool y > + depends on HAVE_ARCH_SECCOMP > + prompt "Enable seccomp to safely compute untrusted bytecode" > + help > + This kernel feature is useful for number crunching applications > + that may need to compute untrusted bytecode during their > + execution. By using pipes or other transports made available to > + the process as file descriptors supporting the read/write > + syscalls, it's possible to isolate those applications in > + their own address space using seccomp. Once seccomp is > + enabled via prctl(PR_SET_SECCOMP), it cannot be disabled > + and the task is only allowed to execute a few safe syscalls > + defined by each seccomp mode. > + > + If unsure, say Y. Only embedded should say N here. > + Please tell me why SECCOMP is special, and deserves to default to be enabled. Is it really that critical, given only 13.5 (half of sparc ;-) out of 24 architectures implement support for it? Thanks! Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig 2020-10-27 9:52 ` Geert Uytterhoeven @ 2020-10-27 19:08 ` YiFei Zhu 2020-10-28 0:06 ` Kees Cook 1 sibling, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-10-27 19:08 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Linux Containers, YiFei Zhu, bpf, Linux Kernel Mailing List, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Tue, Oct 27, 2020 at 4:52 AM Geert Uytterhoeven <geert@linux-m68k.org> wrote: > Please tell me why SECCOMP is special, and deserves to default to be > enabled. Is it really that critical, given only 13.5 (half of sparc > ;-) out of 24 > architectures implement support for it? Good point. My thought process is that quite a few system software are reliant on seccomp for enforcing policies -- systemd, docker, and other sandboxing tools like browsers and firejail, so when I moved this to the non-perarch section, it at least has to be default for x86. Granted, I'm not super familiar with other architectures, so you are probably right that those that did not have it on by default should be kept off by default; many of them could be for embedded devices. What's the best way to do this? Set it as default N in Kconfig and add CONFIG_SECCOMP=y in each arch's defconfig? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig 2020-10-27 9:52 ` Geert Uytterhoeven 2020-10-27 19:08 ` YiFei Zhu @ 2020-10-28 0:06 ` Kees Cook 2020-10-28 8:18 ` Geert Uytterhoeven 1 sibling, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-10-28 0:06 UTC (permalink / raw) To: Geert Uytterhoeven Cc: YiFei Zhu, containers, YiFei Zhu, bpf, Linux Kernel Mailing List, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Tue, Oct 27, 2020 at 10:52:39AM +0100, Geert Uytterhoeven wrote: > Hi Yifei, > > On Thu, Sep 24, 2020 at 2:48 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > From: YiFei Zhu <yifeifz2@illinois.edu> > > > > In order to make adding configurable features into seccomp > > easier, it's better to have the options at one single location, > > considering easpecially that the bulk of seccomp code is > > arch-independent. An quick look also show that many SECCOMP > > descriptions are outdated; they talk about /proc rather than > > prctl. > > > > As a result of moving the config option and keeping it default > > on, architectures arm, arm64, csky, riscv, sh, and xtensa > > did not have SECCOMP on by default prior to this and SECCOMP will > > be default in this change. > > > > Architectures microblaze, mips, powerpc, s390, sh, and sparc > > have an outdated depend on PROC_FS and this dependency is removed > > in this change. > > > > Suggested-by: Jann Horn <jannh@google.com> > > Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/ > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > > Thanks for your patch. which is now commit 282a181b1a0d66de ("seccomp: > Move config option SECCOMP to arch/Kconfig") in v5.10-rc1. > > > --- a/arch/Kconfig > > +++ b/arch/Kconfig > > @@ -458,6 +462,23 @@ config HAVE_ARCH_SECCOMP_FILTER > > results in the system call being skipped immediately. > > - seccomp syscall wired up > > > > +config SECCOMP > > + def_bool y > > + depends on HAVE_ARCH_SECCOMP > > + prompt "Enable seccomp to safely compute untrusted bytecode" > > + help > > + This kernel feature is useful for number crunching applications > > + that may need to compute untrusted bytecode during their > > + execution. By using pipes or other transports made available to > > + the process as file descriptors supporting the read/write > > + syscalls, it's possible to isolate those applications in > > + their own address space using seccomp. Once seccomp is > > + enabled via prctl(PR_SET_SECCOMP), it cannot be disabled > > + and the task is only allowed to execute a few safe syscalls > > + defined by each seccomp mode. > > + > > + If unsure, say Y. Only embedded should say N here. > > + > > Please tell me why SECCOMP is special, and deserves to default to be > enabled. Is it really that critical, given only 13.5 (half of sparc > ;-) out of 24 > architectures implement support for it? That's an excellent point; I missed this in my review as I saw several Kconfig already marked "def_bool y" but failed to note it wasn't _all_ of them. Okay, checking before this patch, these had them effectively enabled: via Kconfig: parisc s390 um x86 via defconfig, roughly speaking: arm arm64 sh How about making the default depend on HAVE_ARCH_SECCOMP_FILTER? These have SECCOMP_FILTER support: arch/arm/Kconfig: select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT arch/arm64/Kconfig: select HAVE_ARCH_SECCOMP_FILTER arch/csky/Kconfig: select HAVE_ARCH_SECCOMP_FILTER arch/mips/Kconfig: select HAVE_ARCH_SECCOMP_FILTER arch/parisc/Kconfig: select HAVE_ARCH_SECCOMP_FILTER arch/powerpc/Kconfig: select HAVE_ARCH_SECCOMP_FILTER arch/riscv/Kconfig: select HAVE_ARCH_SECCOMP_FILTER arch/s390/Kconfig: select HAVE_ARCH_SECCOMP_FILTER arch/sh/Kconfig: select HAVE_ARCH_SECCOMP_FILTER arch/um/Kconfig: select HAVE_ARCH_SECCOMP_FILTER arch/x86/Kconfig: select HAVE_ARCH_SECCOMP_FILTER arch/xtensa/Kconfig: select HAVE_ARCH_SECCOMP_FILTER So the "new" promotions would be: csky mips powerpc riscv xtensa Which would leave only these two: arch/microblaze/Kconfig: select HAVE_ARCH_SECCOMP arch/sparc/Kconfig: select HAVE_ARCH_SECCOMP if SPARC64 At this point, given the ubiquity of seccomp usage (e.g. systemd), I guess it's not unreasonable to make it def_bool y? I'm open to suggestions! -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig 2020-10-28 0:06 ` Kees Cook @ 2020-10-28 8:18 ` Geert Uytterhoeven 2020-10-28 9:34 ` Jann Horn 0 siblings, 1 reply; 149+ messages in thread From: Geert Uytterhoeven @ 2020-10-28 8:18 UTC (permalink / raw) To: Kees Cook Cc: YiFei Zhu, containers, YiFei Zhu, bpf, Linux Kernel Mailing List, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry Hi Kees, On Wed, Oct 28, 2020 at 1:06 AM Kees Cook <keescook@chromium.org> wrote: > On Tue, Oct 27, 2020 at 10:52:39AM +0100, Geert Uytterhoeven wrote: > > On Thu, Sep 24, 2020 at 2:48 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > > From: YiFei Zhu <yifeifz2@illinois.edu> > > > > > > In order to make adding configurable features into seccomp > > > easier, it's better to have the options at one single location, > > > considering easpecially that the bulk of seccomp code is > > > arch-independent. An quick look also show that many SECCOMP > > > descriptions are outdated; they talk about /proc rather than > > > prctl. > > > > > > As a result of moving the config option and keeping it default > > > on, architectures arm, arm64, csky, riscv, sh, and xtensa > > > did not have SECCOMP on by default prior to this and SECCOMP will > > > be default in this change. > > > > > > Architectures microblaze, mips, powerpc, s390, sh, and sparc > > > have an outdated depend on PROC_FS and this dependency is removed > > > in this change. > > > > > > Suggested-by: Jann Horn <jannh@google.com> > > > Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/ > > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > > > > Thanks for your patch. which is now commit 282a181b1a0d66de ("seccomp: > > Move config option SECCOMP to arch/Kconfig") in v5.10-rc1. > > > > > --- a/arch/Kconfig > > > +++ b/arch/Kconfig > > > @@ -458,6 +462,23 @@ config HAVE_ARCH_SECCOMP_FILTER > > > results in the system call being skipped immediately. > > > - seccomp syscall wired up > > > > > > +config SECCOMP > > > + def_bool y > > > + depends on HAVE_ARCH_SECCOMP > > > + prompt "Enable seccomp to safely compute untrusted bytecode" > > > + help > > > + This kernel feature is useful for number crunching applications > > > + that may need to compute untrusted bytecode during their > > > + execution. By using pipes or other transports made available to > > > + the process as file descriptors supporting the read/write > > > + syscalls, it's possible to isolate those applications in > > > + their own address space using seccomp. Once seccomp is > > > + enabled via prctl(PR_SET_SECCOMP), it cannot be disabled > > > + and the task is only allowed to execute a few safe syscalls > > > + defined by each seccomp mode. > > > + > > > + If unsure, say Y. Only embedded should say N here. > > > + > > > > Please tell me why SECCOMP is special, and deserves to default to be > > enabled. Is it really that critical, given only 13.5 (half of sparc > > ;-) out of 24 > > architectures implement support for it? > > That's an excellent point; I missed this in my review as I saw several > Kconfig already marked "def_bool y" but failed to note it wasn't _all_ > of them. Okay, checking before this patch, these had them effectively > enabled: > > via Kconfig: > > parisc > s390 > um > x86 Mostly "server" and "desktop" platforms. > via defconfig, roughly speaking: > > arm > arm64 > sh Note that these defconfigs are example configs, not meant for production. E.g. arm/multi_v7_defconfig and arm64/defconfig enable about everything for compile coverage. > How about making the default depend on HAVE_ARCH_SECCOMP_FILTER? > > These have SECCOMP_FILTER support: > > arch/arm/Kconfig: select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT > arch/arm64/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > arch/csky/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > arch/mips/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > arch/parisc/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > arch/powerpc/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > arch/riscv/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > arch/s390/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > arch/sh/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > arch/um/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > arch/x86/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > arch/xtensa/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > So the "new" promotions would be: > > csky > mips > powerpc > riscv > xtensa > > Which would leave only these two: > > arch/microblaze/Kconfig: select HAVE_ARCH_SECCOMP > arch/sparc/Kconfig: select HAVE_ARCH_SECCOMP if SPARC64 > > At this point, given the ubiquity of seccomp usage (e.g. systemd), I > guess it's not unreasonable to make it def_bool y? Having support does not necessarily imply you want it enabled. If systemd needs it (does it? I have Debian nfsroots with systemd, without SECCOMP), you can enable it in the defconfig. "Default y" is for things you cannot do without, unless you know better. Bloat-o-meter says enabling SECCOMP consumes only ca. 8 KiB (on arm32), so perhaps "default y if !EXPERT"? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig 2020-10-28 8:18 ` Geert Uytterhoeven @ 2020-10-28 9:34 ` Jann Horn 0 siblings, 0 replies; 149+ messages in thread From: Jann Horn @ 2020-10-28 9:34 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Kees Cook, YiFei Zhu, Linux Containers, YiFei Zhu, bpf, Linux Kernel Mailing List, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Oct 28, 2020 at 9:19 AM Geert Uytterhoeven <geert@linux-m68k.org> wrote: > On Wed, Oct 28, 2020 at 1:06 AM Kees Cook <keescook@chromium.org> wrote: > > On Tue, Oct 27, 2020 at 10:52:39AM +0100, Geert Uytterhoeven wrote: > > > On Thu, Sep 24, 2020 at 2:48 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > > > From: YiFei Zhu <yifeifz2@illinois.edu> > > > > > > > > In order to make adding configurable features into seccomp > > > > easier, it's better to have the options at one single location, > > > > considering easpecially that the bulk of seccomp code is > > > > arch-independent. An quick look also show that many SECCOMP > > > > descriptions are outdated; they talk about /proc rather than > > > > prctl. > > > > > > > > As a result of moving the config option and keeping it default > > > > on, architectures arm, arm64, csky, riscv, sh, and xtensa > > > > did not have SECCOMP on by default prior to this and SECCOMP will > > > > be default in this change. > > > > > > > > Architectures microblaze, mips, powerpc, s390, sh, and sparc > > > > have an outdated depend on PROC_FS and this dependency is removed > > > > in this change. > > > > > > > > Suggested-by: Jann Horn <jannh@google.com> > > > > Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/ > > > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > > > > > > Thanks for your patch. which is now commit 282a181b1a0d66de ("seccomp: > > > Move config option SECCOMP to arch/Kconfig") in v5.10-rc1. > > > > > > > --- a/arch/Kconfig > > > > +++ b/arch/Kconfig > > > > @@ -458,6 +462,23 @@ config HAVE_ARCH_SECCOMP_FILTER > > > > results in the system call being skipped immediately. > > > > - seccomp syscall wired up > > > > > > > > +config SECCOMP > > > > + def_bool y > > > > + depends on HAVE_ARCH_SECCOMP > > > > + prompt "Enable seccomp to safely compute untrusted bytecode" > > > > + help > > > > + This kernel feature is useful for number crunching applications > > > > + that may need to compute untrusted bytecode during their > > > > + execution. By using pipes or other transports made available to > > > > + the process as file descriptors supporting the read/write > > > > + syscalls, it's possible to isolate those applications in > > > > + their own address space using seccomp. Once seccomp is > > > > + enabled via prctl(PR_SET_SECCOMP), it cannot be disabled > > > > + and the task is only allowed to execute a few safe syscalls > > > > + defined by each seccomp mode. > > > > + > > > > + If unsure, say Y. Only embedded should say N here. > > > > + > > > > > > Please tell me why SECCOMP is special, and deserves to default to be > > > enabled. Is it really that critical, given only 13.5 (half of sparc > > > ;-) out of 24 > > > architectures implement support for it? > > > > That's an excellent point; I missed this in my review as I saw several > > Kconfig already marked "def_bool y" but failed to note it wasn't _all_ > > of them. Okay, checking before this patch, these had them effectively > > enabled: > > > > via Kconfig: > > > > parisc > > s390 > > um > > x86 > > Mostly "server" and "desktop" platforms. > > > via defconfig, roughly speaking: > > > > arm > > arm64 > > sh > > Note that these defconfigs are example configs, not meant for production. > E.g. arm/multi_v7_defconfig and arm64/defconfig enable about everything > for compile coverage. > > > How about making the default depend on HAVE_ARCH_SECCOMP_FILTER? > > > > These have SECCOMP_FILTER support: > > > > arch/arm/Kconfig: select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT > > arch/arm64/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > arch/csky/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > arch/mips/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > arch/parisc/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > arch/powerpc/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > arch/riscv/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > arch/s390/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > arch/sh/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > arch/um/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > arch/x86/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > arch/xtensa/Kconfig: select HAVE_ARCH_SECCOMP_FILTER > > > > So the "new" promotions would be: > > > > csky > > mips > > powerpc > > riscv > > xtensa > > > > Which would leave only these two: > > > > arch/microblaze/Kconfig: select HAVE_ARCH_SECCOMP > > arch/sparc/Kconfig: select HAVE_ARCH_SECCOMP if SPARC64 > > > > At this point, given the ubiquity of seccomp usage (e.g. systemd), I > > guess it's not unreasonable to make it def_bool y? > > Having support does not necessarily imply you want it enabled. > If systemd needs it (does it? I have Debian nfsroots with systemd, > without SECCOMP), you can enable it in the defconfig. > "Default y" is for things you cannot do without, unless you know > better. > > Bloat-o-meter says enabling SECCOMP consumes only ca. 8 KiB > (on arm32), so perhaps "default y if !EXPERT"? Gating a *default* on EXPERT seems weird to me. Isn't EXPERT normally used to gate whether things are configurable at all (using "if EXPERT")? I think that at least on systems with MMU, SECCOMP should default to y, independent of what EXPERT is set to. When SECCOMP is disabled, various pieces of software will have to (potentially invisibly to the user) degrade their belts-and-suspenders security measures. For example, as far as I understand, systemd has support for using seccomp to restrict what services can do (and uses that for some of its built-in services), but skips those steps with a log message if you don't have SECCOMP. Perhaps more importantly, the same thing happens in OpenSSH's ssh_sandbox_child() function - it generates a debug message, then continues on. If someone does manage to find an OpenSSH pre-auth remote code execution bug in a few years, I think we very much wouldn't want to be in a situation where that can be used to compromise a bunch of routers just because SECCOMP wasn't in the default config, or because it was invisibly disabled when the router vendor enabled EXPERT so that they can get rid of io_uring support. ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-24 12:44 ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu @ 2020-09-24 12:44 ` YiFei Zhu 2020-09-24 13:47 ` David Laight 2020-09-24 12:44 ` [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu ` (3 subsequent siblings) 5 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Seccomp cache emulator needs to know all the architecture numbers that syscall_get_arch() could return for the kernel build in order to generate a cache for all of them. The array is declared in header as static __maybe_unused const to maximize compiler optimiation opportunities such as loop unrolling. Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/alpha/include/asm/syscall.h | 4 ++++ arch/arc/include/asm/syscall.h | 24 +++++++++++++++++++----- arch/arm/include/asm/syscall.h | 4 ++++ arch/arm64/include/asm/syscall.h | 4 ++++ arch/c6x/include/asm/syscall.h | 13 +++++++++++-- arch/csky/include/asm/syscall.h | 4 ++++ arch/h8300/include/asm/syscall.h | 4 ++++ arch/hexagon/include/asm/syscall.h | 4 ++++ arch/ia64/include/asm/syscall.h | 4 ++++ arch/m68k/include/asm/syscall.h | 4 ++++ arch/microblaze/include/asm/syscall.h | 4 ++++ arch/mips/include/asm/syscall.h | 16 ++++++++++++++++ arch/nds32/include/asm/syscall.h | 13 +++++++++++-- arch/nios2/include/asm/syscall.h | 4 ++++ arch/openrisc/include/asm/syscall.h | 4 ++++ arch/parisc/include/asm/syscall.h | 7 +++++++ arch/powerpc/include/asm/syscall.h | 14 ++++++++++++++ arch/riscv/include/asm/syscall.h | 14 ++++++++++---- arch/s390/include/asm/syscall.h | 7 +++++++ arch/sh/include/asm/syscall_32.h | 17 +++++++++++------ arch/sparc/include/asm/syscall.h | 9 +++++++++ arch/x86/include/asm/syscall.h | 11 +++++++++++ arch/x86/um/asm/syscall.h | 14 ++++++++++---- arch/xtensa/include/asm/syscall.h | 4 ++++ 24 files changed, 184 insertions(+), 23 deletions(-) diff --git a/arch/alpha/include/asm/syscall.h b/arch/alpha/include/asm/syscall.h index 11c688c1d7ec..625ac9b23f37 100644 --- a/arch/alpha/include/asm/syscall.h +++ b/arch/alpha/include/asm/syscall.h @@ -4,6 +4,10 @@ #include <uapi/linux/audit.h> +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_ALPHA +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_ALPHA; diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h index 94529e89dff0..899c13cbf5cc 100644 --- a/arch/arc/include/asm/syscall.h +++ b/arch/arc/include/asm/syscall.h @@ -65,14 +65,28 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, } } +#ifdef CONFIG_ISA_ARCOMPACT +# ifdef CONFIG_CPU_BIG_ENDIAN +# define SYSCALL_ARCH AUDIT_ARCH_ARCOMPACTBE +# else +# define SYSCALL_ARCH AUDIT_ARCH_ARCOMPACT +# endif /* CONFIG_CPU_BIG_ENDIAN */ +#else +# ifdef CONFIG_CPU_BIG_ENDIAN +# define SYSCALL_ARCH AUDIT_ARCH_ARCV2BE +# else +# define SYSCALL_ARCH AUDIT_ARCH_ARCV2 +# endif /* CONFIG_CPU_BIG_ENDIAN */ +#endif /* CONFIG_ISA_ARCOMPACT */ + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + static inline int syscall_get_arch(struct task_struct *task) { - return IS_ENABLED(CONFIG_ISA_ARCOMPACT) - ? (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) - ? AUDIT_ARCH_ARCOMPACTBE : AUDIT_ARCH_ARCOMPACT) - : (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) - ? AUDIT_ARCH_ARCV2BE : AUDIT_ARCH_ARCV2); + return SYSCALL_ARCH; } #endif diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h index fd02761ba06c..33ade26e3956 100644 --- a/arch/arm/include/asm/syscall.h +++ b/arch/arm/include/asm/syscall.h @@ -73,6 +73,10 @@ static inline void syscall_set_arguments(struct task_struct *task, memcpy(®s->ARM_r0 + 1, args, 5 * sizeof(args[0])); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_ARM +}; + static inline int syscall_get_arch(struct task_struct *task) { /* ARM tasks don't change audit architectures on the fly. */ diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h index cfc0672013f6..77f3d300e7a0 100644 --- a/arch/arm64/include/asm/syscall.h +++ b/arch/arm64/include/asm/syscall.h @@ -82,6 +82,10 @@ static inline void syscall_set_arguments(struct task_struct *task, memcpy(®s->regs[1], args, 5 * sizeof(args[0])); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_ARM, AUDIT_ARCH_AARCH64 +}; + /* * We don't care about endianness (__AUDIT_ARCH_LE bit) here because * AArch64 has the same system calls both on little- and big- endian. diff --git a/arch/c6x/include/asm/syscall.h b/arch/c6x/include/asm/syscall.h index 38f3e2284ecd..0d78c67ee1fc 100644 --- a/arch/c6x/include/asm/syscall.h +++ b/arch/c6x/include/asm/syscall.h @@ -66,10 +66,19 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->a9 = *args; } +#ifdef CONFIG_CPU_BIG_ENDIAN +#define SYSCALL_ARCH AUDIT_ARCH_C6XBE +#else +#define SYSCALL_ARCH AUDIT_ARCH_C6X +#endif + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + static inline int syscall_get_arch(struct task_struct *task) { - return IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) - ? AUDIT_ARCH_C6XBE : AUDIT_ARCH_C6X; + return SYSCALL_ARCH; } #endif /* __ASM_C6X_SYSCALLS_H */ diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h index f624fa3bbc22..86242d2850d7 100644 --- a/arch/csky/include/asm/syscall.h +++ b/arch/csky/include/asm/syscall.h @@ -68,6 +68,10 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs, memcpy(®s->a1, args, 5 * sizeof(regs->a1)); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_CSKY +}; + static inline int syscall_get_arch(struct task_struct *task) { diff --git a/arch/h8300/include/asm/syscall.h b/arch/h8300/include/asm/syscall.h index 01666b8bb263..775f6ac8fde3 100644 --- a/arch/h8300/include/asm/syscall.h +++ b/arch/h8300/include/asm/syscall.h @@ -28,6 +28,10 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, *args = regs->er6; } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_H8300 +}; + static inline int syscall_get_arch(struct task_struct *task) { diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h index f6e454f18038..6ee21a76f6a3 100644 --- a/arch/hexagon/include/asm/syscall.h +++ b/arch/hexagon/include/asm/syscall.h @@ -45,6 +45,10 @@ static inline long syscall_get_return_value(struct task_struct *task, return regs->r00; } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_HEXAGON +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_HEXAGON; diff --git a/arch/ia64/include/asm/syscall.h b/arch/ia64/include/asm/syscall.h index 6c6f16e409a8..19456125c89a 100644 --- a/arch/ia64/include/asm/syscall.h +++ b/arch/ia64/include/asm/syscall.h @@ -71,6 +71,10 @@ static inline void syscall_set_arguments(struct task_struct *task, ia64_syscall_get_set_arguments(task, regs, args, 1); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_IA64 +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_IA64; diff --git a/arch/m68k/include/asm/syscall.h b/arch/m68k/include/asm/syscall.h index 465ac039be09..031b051f9026 100644 --- a/arch/m68k/include/asm/syscall.h +++ b/arch/m68k/include/asm/syscall.h @@ -4,6 +4,10 @@ #include <uapi/linux/audit.h> +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_M68K +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_M68K; diff --git a/arch/microblaze/include/asm/syscall.h b/arch/microblaze/include/asm/syscall.h index 3a6924f3cbde..28cde14056d1 100644 --- a/arch/microblaze/include/asm/syscall.h +++ b/arch/microblaze/include/asm/syscall.h @@ -105,6 +105,10 @@ static inline void syscall_set_arguments(struct task_struct *task, asmlinkage unsigned long do_syscall_trace_enter(struct pt_regs *regs); asmlinkage void do_syscall_trace_leave(struct pt_regs *regs); +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_MICROBLAZE +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_MICROBLAZE; diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h index 25fa651c937d..29e4c1c47c54 100644 --- a/arch/mips/include/asm/syscall.h +++ b/arch/mips/include/asm/syscall.h @@ -140,6 +140,22 @@ extern const unsigned long sys_call_table[]; extern const unsigned long sys32_call_table[]; extern const unsigned long sysn32_call_table[]; +static __maybe_unused const int syscall_arches[] = { +#ifdef __LITTLE_ENDIAN + AUDIT_ARCH_MIPSEL, +# ifdef CONFIG_64BIT + AUDIT_ARCH_MIPSEL64, + AUDIT_ARCH_MIPSEL64N32, +# endif /* CONFIG_64BIT */ +#else + AUDIT_ARCH_MIPS, +# ifdef CONFIG_64BIT + AUDIT_ARCH_MIPS64, + AUDIT_ARCH_MIPS64N32, +# endif /* CONFIG_64BIT */ +#endif /* __LITTLE_ENDIAN */ +}; + static inline int syscall_get_arch(struct task_struct *task) { int arch = AUDIT_ARCH_MIPS; diff --git a/arch/nds32/include/asm/syscall.h b/arch/nds32/include/asm/syscall.h index 7b5180d78e20..2dd5e33bcfcb 100644 --- a/arch/nds32/include/asm/syscall.h +++ b/arch/nds32/include/asm/syscall.h @@ -154,11 +154,20 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs, memcpy(®s->uregs[0] + 1, args, 5 * sizeof(args[0])); } +#ifdef CONFIG_CPU_BIG_ENDIAN +#define SYSCALL_ARCH AUDIT_ARCH_NDS32BE +#else +#define SYSCALL_ARCH AUDIT_ARCH_NDS32 +#endif + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + static inline int syscall_get_arch(struct task_struct *task) { - return IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) - ? AUDIT_ARCH_NDS32BE : AUDIT_ARCH_NDS32; + return SYSCALL_ARCH; } #endif /* _ASM_NDS32_SYSCALL_H */ diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h index 526449edd768..8fa2716cac5a 100644 --- a/arch/nios2/include/asm/syscall.h +++ b/arch/nios2/include/asm/syscall.h @@ -69,6 +69,10 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->r9 = *args; } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_NIOS2 +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_NIOS2; diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h index e6383be2a195..4eb28ad08042 100644 --- a/arch/openrisc/include/asm/syscall.h +++ b/arch/openrisc/include/asm/syscall.h @@ -64,6 +64,10 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs, memcpy(®s->gpr[3], args, 6 * sizeof(args[0])); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_OPENRISC +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_OPENRISC; diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h index 00b127a5e09b..2915f140c9fd 100644 --- a/arch/parisc/include/asm/syscall.h +++ b/arch/parisc/include/asm/syscall.h @@ -55,6 +55,13 @@ static inline void syscall_rollback(struct task_struct *task, /* do nothing */ } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_PARISC, +#ifdef CONFIG_64BIT + AUDIT_ARCH_PARISC64, +#endif +}; + static inline int syscall_get_arch(struct task_struct *task) { int arch = AUDIT_ARCH_PARISC; diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h index fd1b518eed17..781deb211e3d 100644 --- a/arch/powerpc/include/asm/syscall.h +++ b/arch/powerpc/include/asm/syscall.h @@ -104,6 +104,20 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->orig_gpr3 = args[0]; } +static __maybe_unused const int syscall_arches[] = { +#ifdef __LITTLE_ENDIAN__ + AUDIT_ARCH_PPC | __AUDIT_ARCH_LE, +# ifdef CONFIG_PPC64 + AUDIT_ARCH_PPC64LE, +# endif /* CONFIG_PPC64 */ +#else + AUDIT_ARCH_PPC, +# ifdef CONFIG_PPC64 + AUDIT_ARCH_PPC64, +# endif /* CONFIG_PPC64 */ +#endif /* __LITTLE_ENDIAN__ */ +}; + static inline int syscall_get_arch(struct task_struct *task) { int arch; diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h index 49350c8bd7b0..4b36d358243e 100644 --- a/arch/riscv/include/asm/syscall.h +++ b/arch/riscv/include/asm/syscall.h @@ -73,13 +73,19 @@ static inline void syscall_set_arguments(struct task_struct *task, memcpy(®s->a1, args, 5 * sizeof(regs->a1)); } -static inline int syscall_get_arch(struct task_struct *task) -{ #ifdef CONFIG_64BIT - return AUDIT_ARCH_RISCV64; +#define SYSCALL_ARCH AUDIT_ARCH_RISCV64 #else - return AUDIT_ARCH_RISCV32; +#define SYSCALL_ARCH AUDIT_ARCH_RISCV32 #endif + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + +static inline int syscall_get_arch(struct task_struct *task) +{ + return SYSCALL_ARCH; } #endif /* _ASM_RISCV_SYSCALL_H */ diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h index d9d5de0f67ff..4cb9da36610a 100644 --- a/arch/s390/include/asm/syscall.h +++ b/arch/s390/include/asm/syscall.h @@ -89,6 +89,13 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->orig_gpr2 = args[0]; } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_S390X, +#ifdef CONFIG_COMPAT + AUDIT_ARCH_S390, +#endif +}; + static inline int syscall_get_arch(struct task_struct *task) { #ifdef CONFIG_COMPAT diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h index cb51a7528384..4780f2339c72 100644 --- a/arch/sh/include/asm/syscall_32.h +++ b/arch/sh/include/asm/syscall_32.h @@ -69,13 +69,18 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->regs[4] = args[0]; } -static inline int syscall_get_arch(struct task_struct *task) -{ - int arch = AUDIT_ARCH_SH; - #ifdef CONFIG_CPU_LITTLE_ENDIAN - arch |= __AUDIT_ARCH_LE; +#define SYSCALL_ARCH AUDIT_ARCH_SHEL +#else +#define SYSCALL_ARCH AUDIT_ARCH_SH #endif - return arch; + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + +static inline int syscall_get_arch(struct task_struct *task) +{ + return SYSCALL_ARCH; } #endif /* __ASM_SH_SYSCALL_32_H */ diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h index 62a5a78804c4..a458992cdcfe 100644 --- a/arch/sparc/include/asm/syscall.h +++ b/arch/sparc/include/asm/syscall.h @@ -127,6 +127,15 @@ static inline void syscall_set_arguments(struct task_struct *task, regs->u_regs[UREG_I0 + i] = args[i]; } +static __maybe_unused const int syscall_arches[] = { +#ifdef CONFIG_SPARC64 + AUDIT_ARCH_SPARC64, +#endif +#if !defined(CONFIG_SPARC64) || defined(CONFIG_COMPAT) + AUDIT_ARCH_SPARC, +#endif +}; + static inline int syscall_get_arch(struct task_struct *task) { #if defined(CONFIG_SPARC64) && defined(CONFIG_COMPAT) diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index 7cbf733d11af..e13bb2a65b6f 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -97,6 +97,10 @@ static inline void syscall_set_arguments(struct task_struct *task, memcpy(®s->bx + i, args, n * sizeof(args[0])); } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_I386 +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_I386; @@ -152,6 +156,13 @@ static inline void syscall_set_arguments(struct task_struct *task, } } +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_X86_64, +#ifdef CONFIG_IA32_EMULATION + AUDIT_ARCH_I386, +#endif +}; + static inline int syscall_get_arch(struct task_struct *task) { /* x32 tasks should be considered AUDIT_ARCH_X86_64. */ diff --git a/arch/x86/um/asm/syscall.h b/arch/x86/um/asm/syscall.h index 56a2f0913e3c..590a31e22b99 100644 --- a/arch/x86/um/asm/syscall.h +++ b/arch/x86/um/asm/syscall.h @@ -9,13 +9,19 @@ typedef asmlinkage long (*sys_call_ptr_t)(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); -static inline int syscall_get_arch(struct task_struct *task) -{ #ifdef CONFIG_X86_32 - return AUDIT_ARCH_I386; +#define SYSCALL_ARCH AUDIT_ARCH_I386 #else - return AUDIT_ARCH_X86_64; +#define SYSCALL_ARCH AUDIT_ARCH_X86_64 #endif + +static __maybe_unused const int syscall_arches[] = { + SYSCALL_ARCH +}; + +static inline int syscall_get_arch(struct task_struct *task) +{ + return SYSCALL_ARCH; } #endif /* __UM_ASM_SYSCALL_H */ diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h index f9a671cbf933..3d334fb0d329 100644 --- a/arch/xtensa/include/asm/syscall.h +++ b/arch/xtensa/include/asm/syscall.h @@ -14,6 +14,10 @@ #include <asm/ptrace.h> #include <uapi/linux/audit.h> +static __maybe_unused const int syscall_arches[] = { + AUDIT_ARCH_XTENSA +}; + static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_XTENSA; -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* RE: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-24 12:44 ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu @ 2020-09-24 13:47 ` David Laight 2020-09-24 14:16 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: David Laight @ 2020-09-24 13:47 UTC (permalink / raw) To: 'YiFei Zhu', containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu > Sent: 24 September 2020 13:44 > > Seccomp cache emulator needs to know all the architecture numbers > that syscall_get_arch() could return for the kernel build in order > to generate a cache for all of them. > > The array is declared in header as static __maybe_unused const > to maximize compiler optimiation opportunities such as loop > unrolling. I doubt the compiler will do what you want. Looking at it, in most cases there are one or two entries. I think only MIPS has three. So a static inline function that contains a list of conditionals will generate better code that any kind of array lookup. For x86-64 you end up with something like: #ifdef CONFIG_IA32_EMULATION if (sd->arch == AUDIT_ARCH_I386) return xxx; #endif return yyy; Probably saves you having multiple arrays that need to be kept carefully in step. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-24 13:47 ` David Laight @ 2020-09-24 14:16 ` YiFei Zhu 2020-09-24 14:20 ` David Laight 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 14:16 UTC (permalink / raw) To: David Laight Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 8:47 AM David Laight <David.Laight@aculab.com> wrote: > I doubt the compiler will do what you want. > Looking at it, in most cases there are one or two entries. > I think only MIPS has three. It does ;) GCC 10.2.0: $ objdump -d kernel/seccomp.o | less [...] 0000000000001520 <__seccomp_filter>: [...] 1587: 41 8b 54 24 04 mov 0x4(%r12),%edx 158c: b9 08 01 00 00 mov $0x108,%ecx 1591: 81 fa 3e 00 00 c0 cmp $0xc000003e,%edx 1597: 75 2e jne 15c7 <__seccomp_filter+0xa7> [...] 15c7: 81 fa 03 00 00 40 cmp $0x40000003,%edx 15cd: b9 40 01 00 00 mov $0x140,%ecx 15d2: 74 c5 je 1599 <__seccomp_filter+0x79> 15d4: 0f 0b ud2 [...] 0000000000001cb0 <seccomp_cache_prepare>: [...] 1cc4: 41 b9 3e 00 00 c0 mov $0xc000003e,%r9d [...] 1dba: 41 b9 03 00 00 40 mov $0x40000003,%r9d [...] 0000000000002e30 <proc_pid_seccomp_cache>: [...] 2e72: ba 3e 00 00 c0 mov $0xc000003e,%edx [...] 2eb5: ba 03 00 00 40 mov $0x40000003,%edx Granted, I have CC_OPTIMIZE_FOR_PERFORMANCE rather than CC_OPTIMIZE_FOR_SIZE, but this patch itself is trying to sacrifice some of the memory for speed. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* RE: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-24 14:16 ` YiFei Zhu @ 2020-09-24 14:20 ` David Laight 2020-09-24 14:37 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: David Laight @ 2020-09-24 14:20 UTC (permalink / raw) To: 'YiFei Zhu' Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu > Sent: 24 September 2020 15:17 > > On Thu, Sep 24, 2020 at 8:47 AM David Laight <David.Laight@aculab.com> wrote: > > I doubt the compiler will do what you want. > > Looking at it, in most cases there are one or two entries. > > I think only MIPS has three. > > It does ;) GCC 10.2.0: > > $ objdump -d kernel/seccomp.o | less > [...] > 0000000000001520 <__seccomp_filter>: > [...] > 1587: 41 8b 54 24 04 mov 0x4(%r12),%edx > 158c: b9 08 01 00 00 mov $0x108,%ecx > 1591: 81 fa 3e 00 00 c0 cmp $0xc000003e,%edx > 1597: 75 2e jne 15c7 <__seccomp_filter+0xa7> > [...] > 15c7: 81 fa 03 00 00 40 cmp $0x40000003,%edx > 15cd: b9 40 01 00 00 mov $0x140,%ecx > 15d2: 74 c5 je 1599 <__seccomp_filter+0x79> > 15d4: 0f 0b ud2 > [...] > 0000000000001cb0 <seccomp_cache_prepare>: > [...] > 1cc4: 41 b9 3e 00 00 c0 mov $0xc000003e,%r9d > [...] > 1dba: 41 b9 03 00 00 40 mov $0x40000003,%r9d > [...] > 0000000000002e30 <proc_pid_seccomp_cache>: > [...] > 2e72: ba 3e 00 00 c0 mov $0xc000003e,%edx > [...] > 2eb5: ba 03 00 00 40 mov $0x40000003,%edx > > Granted, I have CC_OPTIMIZE_FOR_PERFORMANCE rather than > CC_OPTIMIZE_FOR_SIZE, but this patch itself is trying to sacrifice > some of the memory for speed. Don't both CC_OPTIMIZE_FOR_PERFORMANCE (-??) and CC_OPTIMIZE_FOR_SIZE (-s) generate terrible code? Try with a slghtly older gcc. I think that entire optimisation (discarding const arrays) is very recent. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-24 14:20 ` David Laight @ 2020-09-24 14:37 ` YiFei Zhu 2020-09-24 16:02 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 14:37 UTC (permalink / raw) To: David Laight Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 9:20 AM David Laight <David.Laight@aculab.com> wrote: > > Granted, I have CC_OPTIMIZE_FOR_PERFORMANCE rather than > > CC_OPTIMIZE_FOR_SIZE, but this patch itself is trying to sacrifice > > some of the memory for speed. > > Don't both CC_OPTIMIZE_FOR_PERFORMANCE (-??) and CC_OPTIMIZE_FOR_SIZE (-s) > generate terrible code? You have to choose one for "Compiler optimization level" in "General Setup", no? The former is -O2 and the latter is -Os. > Try with a slghtly older gcc. > I think that entire optimisation (discarding const arrays) > is very recent. Will try, will take a while to get an old GCC to run, however :/ YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array 2020-09-24 14:37 ` YiFei Zhu @ 2020-09-24 16:02 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 16:02 UTC (permalink / raw) To: David Laight Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 9:37 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > Try with a slghtly older gcc. > > I think that entire optimisation (discarding const arrays) > > is very recent. > > Will try, will take a while to get an old GCC to run, however :/ Possibly one of the oldest I can easily get to work is GCC 6.5.0, and unrolling seems is still the case: 0000000000001560 <__seccomp_filter>: [...] 15d4: 41 8b 74 24 04 mov 0x4(%r12),%esi 15d9: bf 08 01 00 00 mov $0x108,%edi 15de: 81 fe 3e 00 00 c0 cmp $0xc000003e,%esi 15e4: 75 30 jne 1616 <__seccomp_filter+0xb6> [...] 1616: 81 fe 03 00 00 40 cmp $0x40000003,%esi 161c: bf 40 01 00 00 mov $0x140,%edi 1621: 74 c3 je 15e6 <__seccomp_filter+0x86> 1623: 0f 0b ud2 Am I overlooking something or should I go further back in the compiler version? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-24 12:44 ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu @ 2020-09-24 12:44 ` YiFei Zhu 2020-09-24 23:25 ` Kees Cook 2020-09-24 12:44 ` [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu ` (2 subsequent siblings) 5 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not access any syscall arguments or instruction pointer. To facilitate this we need a static analyser to know whether a filter will return allow regardless of syscall arguments for a given architecture number / syscall number pair. This is implemented here with a pseudo-emulator, and stored in a per-filter bitmap. Each common BPF instruction (stolen from Kees's list [1]) are emulated. Any weirdness or loading from a syscall argument will cause the emulator to bail. The emulation is also halted if it reaches a return. In that case, if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. Filter dependency is resolved at attach time. If a filter depends on more filters, then we perform an and on its bitmask against its dependee; if the dependee does not guarantee to allow the syscall, then the depender is also marked not to guarantee to allow the syscall. [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/Kconfig | 25 ++++++ kernel/seccomp.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 218 insertions(+), 1 deletion(-) diff --git a/arch/Kconfig b/arch/Kconfig index 6dfc5673215d..8cc3dc87f253 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -489,6 +489,31 @@ config SECCOMP_FILTER See Documentation/userspace-api/seccomp_filter.rst for details. +choice + prompt "Seccomp filter cache" + default SECCOMP_CACHE_NONE + depends on SECCOMP_FILTER + help + Seccomp filters can potentially incur large overhead for each + system call. This can alleviate some of the overhead. + + If in doubt, select 'syscall numbers only'. + +config SECCOMP_CACHE_NONE + bool "None" + help + No caching is done. Seccomp filters will be called each time + a system call occurs in a seccomp-guarded task. + +config SECCOMP_CACHE_NR_ONLY + bool "Syscall number only" + depends on !HAVE_SPARSE_SYSCALL_NR + help + For each syscall number, if the seccomp filter has a fixed + result, store that result in a bitmap to speed up system calls. + +endchoice + config HAVE_ARCH_STACKLEAK bool help diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 3ee59ce0a323..20d33378a092 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -143,6 +143,32 @@ struct notification { struct list_head notifications; }; +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * struct seccomp_cache_filter_data - container for cache's per-filter data + * + * @syscall_ok: A bitmap for each architecture number, where each bit + * represents whether the filter will always allow the syscall. + */ +struct seccomp_cache_filter_data { + DECLARE_BITMAP(syscall_ok[ARRAY_SIZE(syscall_arches)], NR_syscalls); +}; + +#define SECCOMP_EMU_MAX_PENDING_STATES 64 +#else +struct seccomp_cache_filter_data { }; + +static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ + return 0; +} + +static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter, + const struct seccomp_filter *prev) +{ +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * struct seccomp_filter - container for seccomp BPF programs * @@ -185,6 +211,7 @@ struct seccomp_filter { struct notification *notif; struct mutex notify_lock; wait_queue_head_t wqh; + struct seccomp_cache_filter_data cache; }; /* Limit any path through the tree to 256KB worth of instructions. */ @@ -530,6 +557,139 @@ static inline void seccomp_sync_threads(unsigned long flags) } } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * struct seccomp_emu_env - container for seccomp emulator environment + * + * @filter: The cBPF filter instructions. + * @nr: The syscall number we are emulating. + * @arch: The architecture number we are emulating. + * @syscall_ok: Emulation result, whether it is okay for seccomp to cache the + * syscall. + */ +struct seccomp_emu_env { + struct sock_filter *filter; + int arch; + int nr; + bool syscall_ok; +}; + +/** + * struct seccomp_emu_state - container for seccomp emulator state + * + * @next: The next pending state. This structure is a linked list. + * @pc: The current program counter. + * @areg: the value of that A register. + */ +struct seccomp_emu_state { + struct seccomp_emu_state *next; + int pc; + u32 areg; +}; + +/** + * seccomp_emu_step - step one instruction in the emulator + * @env: The emulator environment + * @state: The emulator state + * + * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred. + */ +static int seccomp_emu_step(struct seccomp_emu_env *env, + struct seccomp_emu_state *state) +{ + struct sock_filter *ftest = &env->filter[state->pc++]; + u16 code = ftest->code; + u32 k = ftest->k; + bool compare; + + switch (code) { + case BPF_LD | BPF_W | BPF_ABS: + if (k == offsetof(struct seccomp_data, nr)) + state->areg = env->nr; + else if (k == offsetof(struct seccomp_data, arch)) + state->areg = env->arch; + else + return 1; + + return 0; + case BPF_JMP | BPF_JA: + state->pc += k; + return 0; + case BPF_JMP | BPF_JEQ | BPF_K: + case BPF_JMP | BPF_JGE | BPF_K: + case BPF_JMP | BPF_JGT | BPF_K: + case BPF_JMP | BPF_JSET | BPF_K: + switch (BPF_OP(code)) { + case BPF_JEQ: + compare = state->areg == k; + break; + case BPF_JGT: + compare = state->areg > k; + break; + case BPF_JGE: + compare = state->areg >= k; + break; + case BPF_JSET: + compare = state->areg & k; + break; + default: + WARN_ON(true); + return -EINVAL; + } + + state->pc += compare ? ftest->jt : ftest->jf; + return 0; + case BPF_ALU | BPF_AND | BPF_K: + state->areg &= k; + return 0; + case BPF_RET | BPF_K: + env->syscall_ok = k == SECCOMP_RET_ALLOW; + return 1; + default: + return 1; + } +} + +/** + * seccomp_cache_prepare - emulate the filter to find cachable syscalls + * @sfilter: The seccomp filter + * + * Returns 0 if successful or -errno if error occurred. + */ +int seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; + struct sock_filter *filter = fprog->filter; + int arch, nr, res = 0; + + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { + for (nr = 0; nr < NR_syscalls; nr++) { + struct seccomp_emu_env env = {0}; + struct seccomp_emu_state state = {0}; + + env.filter = filter; + env.arch = syscall_arches[arch]; + env.nr = nr; + + while (true) { + res = seccomp_emu_step(&env, &state); + if (res) + break; + } + + if (res < 0) + goto out; + + if (env.syscall_ok) + set_bit(nr, sfilter->cache.syscall_ok[arch]); + } + } + +out: + return res; +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_prepare_filter: Prepares a seccomp filter for use. * @fprog: BPF program to install @@ -540,7 +700,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) { struct seccomp_filter *sfilter; int ret; - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) return ERR_PTR(-EINVAL); @@ -571,6 +732,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) return ERR_PTR(ret); } + ret = seccomp_cache_prepare(sfilter); + if (ret < 0) { + bpf_prog_destroy(sfilter->prog); + kfree(sfilter); + return ERR_PTR(ret); + } + refcount_set(&sfilter->refs, 1); refcount_set(&sfilter->users, 1); init_waitqueue_head(&sfilter->wqh); @@ -606,6 +774,29 @@ seccomp_prepare_user_filter(const char __user *user_filter) return filter; } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * seccomp_cache_inherit - mask accept bitmap against previous filter + * @sfilter: The seccomp filter + * @sfilter: The previous seccomp filter + */ +static void seccomp_cache_inherit(struct seccomp_filter *sfilter, + const struct seccomp_filter *prev) +{ + int arch; + + if (!prev) + return; + + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { + bitmap_and(sfilter->cache.syscall_ok[arch], + sfilter->cache.syscall_ok[arch], + prev->cache.syscall_ok[arch], + NR_syscalls); + } +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_attach_filter: validate and attach filter * @flags: flags to change filter behavior @@ -655,6 +846,7 @@ static long seccomp_attach_filter(unsigned int flags, * task reference. */ filter->prev = current->seccomp.filter; + seccomp_cache_inherit(filter, filter->prev); current->seccomp.filter = filter; atomic_inc(¤t->seccomp.filter_count); -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-24 12:44 ` [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu @ 2020-09-24 23:25 ` Kees Cook 2020-09-25 3:04 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-24 23:25 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 07:44:18AM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not > access any syscall arguments or instruction pointer. To facilitate > this we need a static analyser to know whether a filter will > return allow regardless of syscall arguments for a given > architecture number / syscall number pair. This is implemented > here with a pseudo-emulator, and stored in a per-filter bitmap. > > Each common BPF instruction (stolen from Kees's list [1]) are > emulated. Any weirdness or loading from a syscall argument will > cause the emulator to bail. > > The emulation is also halted if it reaches a return. In that case, > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. > > Filter dependency is resolved at attach time. If a filter depends > on more filters, then we perform an and on its bitmask against its > dependee; if the dependee does not guarantee to allow the syscall, > then the depender is also marked not to guarantee to allow the > syscall. > > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > --- > arch/Kconfig | 25 ++++++ > kernel/seccomp.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 218 insertions(+), 1 deletion(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 6dfc5673215d..8cc3dc87f253 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -489,6 +489,31 @@ config SECCOMP_FILTER > > See Documentation/userspace-api/seccomp_filter.rst for details. > > +choice > + prompt "Seccomp filter cache" > + default SECCOMP_CACHE_NONE > + depends on SECCOMP_FILTER > + help > + Seccomp filters can potentially incur large overhead for each > + system call. This can alleviate some of the overhead. > + > + If in doubt, select 'syscall numbers only'. > + > +config SECCOMP_CACHE_NONE > + bool "None" > + help > + No caching is done. Seccomp filters will be called each time > + a system call occurs in a seccomp-guarded task. > + > +config SECCOMP_CACHE_NR_ONLY > + bool "Syscall number only" > + depends on !HAVE_SPARSE_SYSCALL_NR > + help > + For each syscall number, if the seccomp filter has a fixed > + result, store that result in a bitmap to speed up system calls. > + > +endchoice I'm not interested in seccomp having a config option for this. It should entire exist or not, and that depends on the per-architecture support. You mentioned in another thread that you wanted it to let people play with this support in some way. Can you elaborate on this? My perspective is that of distro and vendor kernels: there is _one_ config and end users can't really do anything about it without rolling their own kernels. > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +/** > + * struct seccomp_cache_filter_data - container for cache's per-filter data > + * > + * @syscall_ok: A bitmap for each architecture number, where each bit > + * represents whether the filter will always allow the syscall. > + */ > +struct seccomp_cache_filter_data { > + DECLARE_BITMAP(syscall_ok[ARRAY_SIZE(syscall_arches)], NR_syscalls); > +}; So, as Jann pointed out, using NR_syscalls only accidentally works -- they're actually different sizes and there isn't strictly any reason to expect one to be smaller than another. So, we need to either choose the max() in asm/linux/seccomp.h or be more efficient with space usage and use explicitly named bitmaps (how my v1 does things). > + > +#define SECCOMP_EMU_MAX_PENDING_STATES 64 This isn't used in this patch; likely leftover/in need of moving? > +#else > +struct seccomp_cache_filter_data { }; > + > +static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter) > +{ > + return 0; > +} > + > +static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter, > + const struct seccomp_filter *prev) > +{ > +} > +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ > + > /** > * struct seccomp_filter - container for seccomp BPF programs > * > @@ -185,6 +211,7 @@ struct seccomp_filter { > struct notification *notif; > struct mutex notify_lock; > wait_queue_head_t wqh; > + struct seccomp_cache_filter_data cache; I moved this up in the structure to see if I could benefit from cache line sharing. In either case, we must verify (with "pahole") that we do not induce massive padding in the struct. But yes, attaching this to the filter is the right way to go. > }; > > /* Limit any path through the tree to 256KB worth of instructions. */ > @@ -530,6 +557,139 @@ static inline void seccomp_sync_threads(unsigned long flags) > } > } > > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +/** > + * struct seccomp_emu_env - container for seccomp emulator environment > + * > + * @filter: The cBPF filter instructions. > + * @nr: The syscall number we are emulating. > + * @arch: The architecture number we are emulating. > + * @syscall_ok: Emulation result, whether it is okay for seccomp to cache the > + * syscall. > + */ > +struct seccomp_emu_env { > + struct sock_filter *filter; > + int arch; > + int nr; > + bool syscall_ok; nit: "ok" is too vague. We mean either "constant action" or "allow" (or "filter" in the negative case). > +}; > + > +/** > + * struct seccomp_emu_state - container for seccomp emulator state > + * > + * @next: The next pending state. This structure is a linked list. > + * @pc: The current program counter. > + * @areg: the value of that A register. > + */ > +struct seccomp_emu_state { > + struct seccomp_emu_state *next; > + int pc; > + u32 areg; > +}; Why is this split out? (i.e. why is it not just a self-contained loop the way Jann wrote it?) > + > +/** > + * seccomp_emu_step - step one instruction in the emulator > + * @env: The emulator environment > + * @state: The emulator state > + * > + * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred. I appreciate the -errno intent, but it actually risks making these changes break existing userspace filters: if something is unhandled in the emulator in a way we don't find during design and testing, the filter load will actually _fail_ instead of just falling back to "run filter". Failures should be reported (WARN_ON_ONCE()), but my v1 intentionally lets this continue. > + */ > +static int seccomp_emu_step(struct seccomp_emu_env *env, > + struct seccomp_emu_state *state) > +{ > + struct sock_filter *ftest = &env->filter[state->pc++]; > + u16 code = ftest->code; > + u32 k = ftest->k; > + bool compare; > + > + switch (code) { > + case BPF_LD | BPF_W | BPF_ABS: > + if (k == offsetof(struct seccomp_data, nr)) > + state->areg = env->nr; > + else if (k == offsetof(struct seccomp_data, arch)) > + state->areg = env->arch; > + else > + return 1; > + > + return 0; > + case BPF_JMP | BPF_JA: > + state->pc += k; > + return 0; > + case BPF_JMP | BPF_JEQ | BPF_K: > + case BPF_JMP | BPF_JGE | BPF_K: > + case BPF_JMP | BPF_JGT | BPF_K: > + case BPF_JMP | BPF_JSET | BPF_K: > + switch (BPF_OP(code)) { > + case BPF_JEQ: > + compare = state->areg == k; > + break; > + case BPF_JGT: > + compare = state->areg > k; > + break; > + case BPF_JGE: > + compare = state->areg >= k; > + break; > + case BPF_JSET: > + compare = state->areg & k; > + break; > + default: > + WARN_ON(true); > + return -EINVAL; > + } > + > + state->pc += compare ? ftest->jt : ftest->jf; > + return 0; > + case BPF_ALU | BPF_AND | BPF_K: > + state->areg &= k; > + return 0; > + case BPF_RET | BPF_K: > + env->syscall_ok = k == SECCOMP_RET_ALLOW; > + return 1; > + default: > + return 1; > + } > +} This version appears to have removed all the comments; I liked Jann's comments and I had rearranged things a bit to make it more readable (IMO) for people that do not immediate understand BPF. :) > + > +/** > + * seccomp_cache_prepare - emulate the filter to find cachable syscalls > + * @sfilter: The seccomp filter > + * > + * Returns 0 if successful or -errno if error occurred. > + */ > +int seccomp_cache_prepare(struct seccomp_filter *sfilter) > +{ > + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; > + struct sock_filter *filter = fprog->filter; > + int arch, nr, res = 0; > + > + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { > + for (nr = 0; nr < NR_syscalls; nr++) { > + struct seccomp_emu_env env = {0}; > + struct seccomp_emu_state state = {0}; > + > + env.filter = filter; > + env.arch = syscall_arches[arch]; > + env.nr = nr; > + > + while (true) { > + res = seccomp_emu_step(&env, &state); > + if (res) > + break; > + } > + > + if (res < 0) > + goto out; > + > + if (env.syscall_ok) > + set_bit(nr, sfilter->cache.syscall_ok[arch]); I don't really like the complexity here, passing around syscall_ok, etc. I feel like seccomp_emu_step() should be self-contained to say "allow or filter" directly. I also prefer an inversion to the logic: if we start bitmaps as "default allow", we only ever increase the filtering cases: we can never accidentally ADD an allow to the bitmap. (This was an intentional design in the RFC and v1 to do as much as possible to fail safe.) > + } > + } > + > +out: > + return res; > +} > +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ > + > /** > * seccomp_prepare_filter: Prepares a seccomp filter for use. > * @fprog: BPF program to install > @@ -540,7 +700,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) > { > struct seccomp_filter *sfilter; > int ret; > - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); > + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || > + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); > > if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) > return ERR_PTR(-EINVAL); > @@ -571,6 +732,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) > return ERR_PTR(ret); > } > > + ret = seccomp_cache_prepare(sfilter); > + if (ret < 0) { > + bpf_prog_destroy(sfilter->prog); > + kfree(sfilter); > + return ERR_PTR(ret); > + } Why do the prepare here instead of during attach? (And note that it should not be written to fail.) > + > refcount_set(&sfilter->refs, 1); > refcount_set(&sfilter->users, 1); > init_waitqueue_head(&sfilter->wqh); > @@ -606,6 +774,29 @@ seccomp_prepare_user_filter(const char __user *user_filter) > return filter; > } > > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +/** > + * seccomp_cache_inherit - mask accept bitmap against previous filter > + * @sfilter: The seccomp filter > + * @sfilter: The previous seccomp filter > + */ > +static void seccomp_cache_inherit(struct seccomp_filter *sfilter, > + const struct seccomp_filter *prev) > +{ > + int arch; > + > + if (!prev) > + return; > + > + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { > + bitmap_and(sfilter->cache.syscall_ok[arch], > + sfilter->cache.syscall_ok[arch], > + prev->cache.syscall_ok[arch], > + NR_syscalls); > + } And, as per being as defensive as I can imagine, this should be a one-way mask: we can only remove bits from syscall_ok, never add them. sfilter must be constructed so that it can only ever have fewer or the same bits set as prev. > +} > +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ > + > /** > * seccomp_attach_filter: validate and attach filter > * @flags: flags to change filter behavior > @@ -655,6 +846,7 @@ static long seccomp_attach_filter(unsigned int flags, > * task reference. > */ > filter->prev = current->seccomp.filter; > + seccomp_cache_inherit(filter, filter->prev); In the RFC I did this inherit earlier (in the emulation stage) to benefit from the RET_KILL results, but that's not very useful any more. However, I think it's still code-locality better to keep the bit manipulation logic as close together as possible for readability. > current->seccomp.filter = filter; > atomic_inc(¤t->seccomp.filter_count); > > -- > 2.28.0 > -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-24 23:25 ` Kees Cook @ 2020-09-25 3:04 ` YiFei Zhu 2020-09-25 16:45 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-25 3:04 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry [resending this, forgot to hit reply all...] On Thu, Sep 24, 2020 at 6:25 PM Kees Cook <keescook@chromium.org> wrote: > I'm not interested in seccomp having a config option for this. It should > entire exist or not, and that depends on the per-architecture support. > You mentioned in another thread that you wanted it to let people play > with this support in some way. Can you elaborate on this? My perspective > is that of distro and vendor kernels: there is _one_ config and end > users can't really do anything about it without rolling their own > kernels. That's one. The other is to allow future optional extensions, like syscall-argument-capable accelerators. Distro / vendor kernels will keep defaults anyways, no? > So, as Jann pointed out, using NR_syscalls only accidentally works -- > they're actually different sizes and there isn't strictly any reason to > expect one to be smaller than another. So, we need to either choose the > max() in asm/linux/seccomp.h or be more efficient with space usage and > use explicitly named bitmaps (how my v1 does things). Right. > This isn't used in this patch; likely leftover/in need of moving? Correct. Will remove. > I moved this up in the structure to see if I could benefit from cache > line sharing. In either case, we must verify (with "pahole") that we do > not induce massive padding in the struct. > > But yes, attaching this to the filter is the right way to go. Right. I don't think it would cause massive padding with all I know about padding learnt from [1]. I'm used to use gdb to look at structure layout, and this is what I see: (gdb) ptype /o struct seccomp_filter /* offset | size */ type = struct seccomp_filter { /* 0 | 4 */ refcount_t refs; /* 4 | 4 */ refcount_t users; /* 8 | 1 */ bool log; /* XXX 7-byte hole */ /* 16 | 8 */ struct seccomp_filter *prev; [...] /* 264 | 112 */ struct seccomp_cache_filter_data { /* 264 | 112 */ unsigned long syscall_ok[2][7]; /* total size (bytes): 112 */ } cache; /* total size (bytes): 376 */ } The bitmaps are long-aligned; so is the prev-pointer. If we want we can put the cache struct right before prev and that should not introduce any new holes. It's the refcounts and the bool that's not cooperative. > nit: "ok" is too vague. We mean either "constant action" or "allow" (or > "filter" in the negative case). Right. > Why is this split out? (i.e. why is it not just a self-contained loop > the way Jann wrote it?) Because my brain thinks like a finite state machine and this function is a state transition. ;) Though yeah I agree a loop is probably more readable. > I appreciate the -errno intent, but it actually risks making these > changes break existing userspace filters: if something is unhandled in > the emulator in a way we don't find during design and testing, the > filter load will actually _fail_ instead of just falling back to "run > filter". Failures should be reported (WARN_ON_ONCE()), but my v1 > intentionally lets this continue. Right. > This version appears to have removed all the comments; I liked Jann's > comments and I had rearranged things a bit to make it more readable > (IMO) for people that do not immediate understand BPF. :) Right. > > +/** > > + * seccomp_cache_prepare - emulate the filter to find cachable syscalls > > + * @sfilter: The seccomp filter > > + * > > + * Returns 0 if successful or -errno if error occurred. > > + */ > > +int seccomp_cache_prepare(struct seccomp_filter *sfilter) > > +{ > > + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; > > + struct sock_filter *filter = fprog->filter; > > + int arch, nr, res = 0; > > + > > + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { > > + for (nr = 0; nr < NR_syscalls; nr++) { > > + struct seccomp_emu_env env = {0}; Btw, do you know what is the initial state of the A register at the start of BPF execution? In my RFC I assumed it's unknown but then in v1 after the "reg_known" removal the register is assumed to be 0. Idk if it is correct to assume so. > I don't really like the complexity here, passing around syscall_ok, etc. > I feel like seccomp_emu_step() should be self-contained to say "allow or > filter" directly. Ok. > I also prefer an inversion to the logic: if we start bitmaps as "default > allow", we only ever increase the filtering cases: we can never > accidentally ADD an allow to the bitmap. (This was an intentional design > in the RFC and v1 to do as much as possible to fail safe.) Wait why? If it's default allow, what if you hit an error? You can accidentally not remove an allow from the bitmap, and that is much more of an issue than accidentally not add an allow. I don't understand your reasoning of "accidentally ADD an allow", an action will only occur when everything is right, but an action might not occur if some random shenanigans happen. Hence, the non-action / default side should be the fail-safe side, rather than the action side. > Why do the prepare here instead of during attach? (And note that it > should not be written to fail.) Right. > And, as per being as defensive as I can imagine, this should be a > one-way mask: we can only remove bits from syscall_ok, never add them. > sfilter must be constructed so that it can only ever have fewer or the > same bits set as prev. Right. > In the RFC I did this inherit earlier (in the emulation stage) to > benefit from the RET_KILL results, but that's not very useful any more. > However, I think it's still code-locality better to keep the bit > manipulation logic as close together as possible for readability. Right. [1] http://www.catb.org/esr/structure-packing/#_structure_alignment_and_padding YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-25 3:04 ` YiFei Zhu @ 2020-09-25 16:45 ` YiFei Zhu 2020-09-25 19:42 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-25 16:45 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > Why do the prepare here instead of during attach? (And note that it > > should not be written to fail.) > > Right. During attach a spinlock (current->sighand->siglock) is held. Do we really want to put the emulator in the "atomic section"? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-25 16:45 ` YiFei Zhu @ 2020-09-25 19:42 ` Kees Cook 2020-09-25 19:51 ` Andy Lutomirski 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-25 19:42 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote: > On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > > Why do the prepare here instead of during attach? (And note that it > > > should not be written to fail.) > > > > Right. > > During attach a spinlock (current->sighand->siglock) is held. Do we > really want to put the emulator in the "atomic section"? It's a good point, but I had some other ideas around it that lead to me a different conclusion. Here's what I've got in my head: I don't view filter attach (nor the siglock) as fastpath: the lock is rarely contested and the "long time" will only be during filter attach. When performing filter emulation, all the syscalls that are already marked as "must run filter" on the previous filter can be skipped for the new filter, since it cannot change the outcome, which makes the emulation step faster. The previous filter's bitmap isn't "stable" until siglock is held. If we do the emulation step before siglock, we have to always do full evaluation of all syscalls, and then merge the bitmap during attach. That means all filters ever attached will take maximal time to perform emulation. I prefer the idea of the emulation step taking advantage of the bitmap optimization, since the kernel spends less time doing work over the life of the process tree. It's certainly marginal, but it also lets all the bitmap manipulation stay in one place (as opposed to being split between "prepare" and "attach"). What do you think? -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-25 19:42 ` Kees Cook @ 2020-09-25 19:51 ` Andy Lutomirski 2020-09-25 20:37 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: Andy Lutomirski @ 2020-09-25 19:51 UTC (permalink / raw) To: Kees Cook Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry > On Sep 25, 2020, at 12:42 PM, Kees Cook <keescook@chromium.org> wrote: > > On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote: >> On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: >>>> Why do the prepare here instead of during attach? (And note that it >>>> should not be written to fail.) >>> >>> Right. >> >> During attach a spinlock (current->sighand->siglock) is held. Do we >> really want to put the emulator in the "atomic section"? > > It's a good point, but I had some other ideas around it that lead to me > a different conclusion. Here's what I've got in my head: > > I don't view filter attach (nor the siglock) as fastpath: the lock is > rarely contested and the "long time" will only be during filter attach. > > When performing filter emulation, all the syscalls that are already > marked as "must run filter" on the previous filter can be skipped for > the new filter, since it cannot change the outcome, which makes the > emulation step faster. > > The previous filter's bitmap isn't "stable" until siglock is held. > > If we do the emulation step before siglock, we have to always do full > evaluation of all syscalls, and then merge the bitmap during attach. > That means all filters ever attached will take maximal time to perform > emulation. > > I prefer the idea of the emulation step taking advantage of the bitmap > optimization, since the kernel spends less time doing work over the life > of the process tree. It's certainly marginal, but it also lets all the > bitmap manipulation stay in one place (as opposed to being split between > "prepare" and "attach"). > > What do you think? > > I’m wondering if we should be much much lazier. We could potentially wait until someone actually tries to do a given syscall before we try to evaluate whether the result is fixed. ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-25 19:51 ` Andy Lutomirski @ 2020-09-25 20:37 ` Kees Cook 2020-09-25 21:07 ` Andy Lutomirski 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-25 20:37 UTC (permalink / raw) To: Andy Lutomirski Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Sep 25, 2020 at 12:51:20PM -0700, Andy Lutomirski wrote: > > > > On Sep 25, 2020, at 12:42 PM, Kees Cook <keescook@chromium.org> wrote: > > > > On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote: > >> On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > >>>> Why do the prepare here instead of during attach? (And note that it > >>>> should not be written to fail.) > >>> > >>> Right. > >> > >> During attach a spinlock (current->sighand->siglock) is held. Do we > >> really want to put the emulator in the "atomic section"? > > > > It's a good point, but I had some other ideas around it that lead to me > > a different conclusion. Here's what I've got in my head: > > > > I don't view filter attach (nor the siglock) as fastpath: the lock is > > rarely contested and the "long time" will only be during filter attach. > > > > When performing filter emulation, all the syscalls that are already > > marked as "must run filter" on the previous filter can be skipped for > > the new filter, since it cannot change the outcome, which makes the > > emulation step faster. > > > > The previous filter's bitmap isn't "stable" until siglock is held. > > > > If we do the emulation step before siglock, we have to always do full > > evaluation of all syscalls, and then merge the bitmap during attach. > > That means all filters ever attached will take maximal time to perform > > emulation. > > > > I prefer the idea of the emulation step taking advantage of the bitmap > > optimization, since the kernel spends less time doing work over the life > > of the process tree. It's certainly marginal, but it also lets all the > > bitmap manipulation stay in one place (as opposed to being split between > > "prepare" and "attach"). > > > > What do you think? > > > > > > I’m wondering if we should be much much lazier. We could potentially wait until someone actually tries to do a given syscall before we try to evaluate whether the result is fixed. That seems like we'd need to track yet another bitmap of "did we emulate this yet?" And it means the filter isn't really "done" until you run another syscall? eeh, I'm not a fan: it scratches at my desire for determinism. ;) Or maybe my implementation imagination is missing something? -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-25 20:37 ` Kees Cook @ 2020-09-25 21:07 ` Andy Lutomirski 2020-09-25 23:49 ` Kees Cook 2020-09-26 1:23 ` YiFei Zhu 0 siblings, 2 replies; 149+ messages in thread From: Andy Lutomirski @ 2020-09-25 21:07 UTC (permalink / raw) To: Kees Cook Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Sep 25, 2020 at 1:37 PM Kees Cook <keescook@chromium.org> wrote: > > On Fri, Sep 25, 2020 at 12:51:20PM -0700, Andy Lutomirski wrote: > > > > > > > On Sep 25, 2020, at 12:42 PM, Kees Cook <keescook@chromium.org> wrote: > > > > > > On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote: > > >> On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > >>>> Why do the prepare here instead of during attach? (And note that it > > >>>> should not be written to fail.) > > >>> > > >>> Right. > > >> > > >> During attach a spinlock (current->sighand->siglock) is held. Do we > > >> really want to put the emulator in the "atomic section"? > > > > > > It's a good point, but I had some other ideas around it that lead to me > > > a different conclusion. Here's what I've got in my head: > > > > > > I don't view filter attach (nor the siglock) as fastpath: the lock is > > > rarely contested and the "long time" will only be during filter attach. > > > > > > When performing filter emulation, all the syscalls that are already > > > marked as "must run filter" on the previous filter can be skipped for > > > the new filter, since it cannot change the outcome, which makes the > > > emulation step faster. > > > > > > The previous filter's bitmap isn't "stable" until siglock is held. > > > > > > If we do the emulation step before siglock, we have to always do full > > > evaluation of all syscalls, and then merge the bitmap during attach. > > > That means all filters ever attached will take maximal time to perform > > > emulation. > > > > > > I prefer the idea of the emulation step taking advantage of the bitmap > > > optimization, since the kernel spends less time doing work over the life > > > of the process tree. It's certainly marginal, but it also lets all the > > > bitmap manipulation stay in one place (as opposed to being split between > > > "prepare" and "attach"). > > > > > > What do you think? > > > > > > > > > > I’m wondering if we should be much much lazier. We could potentially wait until someone actually tries to do a given syscall before we try to evaluate whether the result is fixed. > > That seems like we'd need to track yet another bitmap of "did we emulate > this yet?" And it means the filter isn't really "done" until you run > another syscall? eeh, I'm not a fan: it scratches at my desire for > determinism. ;) Or maybe my implementation imagination is missing > something? > We'd need at least three states per syscall: unknown, always-allow, and need-to-run-filter. The downsides are less determinism and a bit of an uglier implementation. The upside is that we don't need to loop over all syscalls at load -- instead the time that each operation takes is independent of the total number of syscalls on the system. And we can entirely avoid, say, evaluating the x32 case until the task tries an x32 syscall. I think it's at least worth considering. --Andy ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-25 21:07 ` Andy Lutomirski @ 2020-09-25 23:49 ` Kees Cook 2020-09-26 0:34 ` Andy Lutomirski 2020-09-26 1:23 ` YiFei Zhu 1 sibling, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-25 23:49 UTC (permalink / raw) To: Andy Lutomirski Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Sep 25, 2020 at 02:07:46PM -0700, Andy Lutomirski wrote: > On Fri, Sep 25, 2020 at 1:37 PM Kees Cook <keescook@chromium.org> wrote: > > > > On Fri, Sep 25, 2020 at 12:51:20PM -0700, Andy Lutomirski wrote: > > > > > > > > > > On Sep 25, 2020, at 12:42 PM, Kees Cook <keescook@chromium.org> wrote: > > > > > > > > On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote: > > > >> On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > > >>>> Why do the prepare here instead of during attach? (And note that it > > > >>>> should not be written to fail.) > > > >>> > > > >>> Right. > > > >> > > > >> During attach a spinlock (current->sighand->siglock) is held. Do we > > > >> really want to put the emulator in the "atomic section"? > > > > > > > > It's a good point, but I had some other ideas around it that lead to me > > > > a different conclusion. Here's what I've got in my head: > > > > > > > > I don't view filter attach (nor the siglock) as fastpath: the lock is > > > > rarely contested and the "long time" will only be during filter attach. > > > > > > > > When performing filter emulation, all the syscalls that are already > > > > marked as "must run filter" on the previous filter can be skipped for > > > > the new filter, since it cannot change the outcome, which makes the > > > > emulation step faster. > > > > > > > > The previous filter's bitmap isn't "stable" until siglock is held. > > > > > > > > If we do the emulation step before siglock, we have to always do full > > > > evaluation of all syscalls, and then merge the bitmap during attach. > > > > That means all filters ever attached will take maximal time to perform > > > > emulation. > > > > > > > > I prefer the idea of the emulation step taking advantage of the bitmap > > > > optimization, since the kernel spends less time doing work over the life > > > > of the process tree. It's certainly marginal, but it also lets all the > > > > bitmap manipulation stay in one place (as opposed to being split between > > > > "prepare" and "attach"). > > > > > > > > What do you think? > > > > > > > > > > > > > > I’m wondering if we should be much much lazier. We could potentially wait until someone actually tries to do a given syscall before we try to evaluate whether the result is fixed. > > > > That seems like we'd need to track yet another bitmap of "did we emulate > > this yet?" And it means the filter isn't really "done" until you run > > another syscall? eeh, I'm not a fan: it scratches at my desire for > > determinism. ;) Or maybe my implementation imagination is missing > > something? > > > > We'd need at least three states per syscall: unknown, always-allow, > and need-to-run-filter. > > The downsides are less determinism and a bit of an uglier > implementation. The upside is that we don't need to loop over all > syscalls at load -- instead the time that each operation takes is > independent of the total number of syscalls on the system. And we can > entirely avoid, say, evaluating the x32 case until the task tries an > x32 syscall. > > I think it's at least worth considering. Yeah, worth considering. I do still think the time spent in emulation is SO small that it doesn't matter running all of the syscalls at attach time. The filters are tiny and fail quickly if anything "interesting" start to happen. ;) -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-25 23:49 ` Kees Cook @ 2020-09-26 0:34 ` Andy Lutomirski 0 siblings, 0 replies; 149+ messages in thread From: Andy Lutomirski @ 2020-09-26 0:34 UTC (permalink / raw) To: Kees Cook Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry > On Sep 25, 2020, at 4:49 PM, Kees Cook <keescook@chromium.org> wrote: > > On Fri, Sep 25, 2020 at 02:07:46PM -0700, Andy Lutomirski wrote: >>> On Fri, Sep 25, 2020 at 1:37 PM Kees Cook <keescook@chromium.org> wrote: >>> >>> On Fri, Sep 25, 2020 at 12:51:20PM -0700, Andy Lutomirski wrote: >>>> >>>> >>>>> On Sep 25, 2020, at 12:42 PM, Kees Cook <keescook@chromium.org> wrote: >>>>> >>>>> On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote: >>>>>> On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: >>>>>>>> Why do the prepare here instead of during attach? (And note that it >>>>>>>> should not be written to fail.) >>>>>>> >>>>>>> Right. >>>>>> >>>>>> During attach a spinlock (current->sighand->siglock) is held. Do we >>>>>> really want to put the emulator in the "atomic section"? >>>>> >>>>> It's a good point, but I had some other ideas around it that lead to me >>>>> a different conclusion. Here's what I've got in my head: >>>>> >>>>> I don't view filter attach (nor the siglock) as fastpath: the lock is >>>>> rarely contested and the "long time" will only be during filter attach. >>>>> >>>>> When performing filter emulation, all the syscalls that are already >>>>> marked as "must run filter" on the previous filter can be skipped for >>>>> the new filter, since it cannot change the outcome, which makes the >>>>> emulation step faster. >>>>> >>>>> The previous filter's bitmap isn't "stable" until siglock is held. >>>>> >>>>> If we do the emulation step before siglock, we have to always do full >>>>> evaluation of all syscalls, and then merge the bitmap during attach. >>>>> That means all filters ever attached will take maximal time to perform >>>>> emulation. >>>>> >>>>> I prefer the idea of the emulation step taking advantage of the bitmap >>>>> optimization, since the kernel spends less time doing work over the life >>>>> of the process tree. It's certainly marginal, but it also lets all the >>>>> bitmap manipulation stay in one place (as opposed to being split between >>>>> "prepare" and "attach"). >>>>> >>>>> What do you think? >>>>> >>>>> >>>> >>>> I’m wondering if we should be much much lazier. We could potentially wait until someone actually tries to do a given syscall before we try to evaluate whether the result is fixed. >>> >>> That seems like we'd need to track yet another bitmap of "did we emulate >>> this yet?" And it means the filter isn't really "done" until you run >>> another syscall? eeh, I'm not a fan: it scratches at my desire for >>> determinism. ;) Or maybe my implementation imagination is missing >>> something? >>> >> >> We'd need at least three states per syscall: unknown, always-allow, >> and need-to-run-filter. >> >> The downsides are less determinism and a bit of an uglier >> implementation. The upside is that we don't need to loop over all >> syscalls at load -- instead the time that each operation takes is >> independent of the total number of syscalls on the system. And we can >> entirely avoid, say, evaluating the x32 case until the task tries an >> x32 syscall. >> >> I think it's at least worth considering. > > Yeah, worth considering. I do still think the time spent in emulation is > SO small that it doesn't matter running all of the syscalls at attach > time. The filters are tiny and fail quickly if anything "interesting" > start to happen. ;) > There’s a middle ground, too: do it lazily per arch. So we would allocate and populate the compat bitmap the first time a compat syscall is attempted and do the same for x32. This may help avoid the annoying extra memory usage and 3x startup overhead while retaining full functionality. ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-25 21:07 ` Andy Lutomirski 2020-09-25 23:49 ` Kees Cook @ 2020-09-26 1:23 ` YiFei Zhu 2020-09-26 2:47 ` Andy Lutomirski 1 sibling, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-26 1:23 UTC (permalink / raw) To: Andy Lutomirski Cc: Kees Cook, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Sep 25, 2020 at 4:07 PM Andy Lutomirski <luto@amacapital.net> wrote: > We'd need at least three states per syscall: unknown, always-allow, > and need-to-run-filter. > > The downsides are less determinism and a bit of an uglier > implementation. The upside is that we don't need to loop over all > syscalls at load -- instead the time that each operation takes is > independent of the total number of syscalls on the system. And we can > entirely avoid, say, evaluating the x32 case until the task tries an > x32 syscall. I was really afraid of multiple tasks writing to the bitmaps at once, hence I used bitmap-per-task. Now I think about it, if this stays lockless, the worst thing that can happen is that a write undo a bit set by another task. In this case, if the "known" bit is cleared then the worst would be the emulation is run many times. But if the "always allow" is cleared but not "known" bit then we have an issue: the syscall will always be executed in BPF. Is it worth holding a spinlock here? Though I'll try to get the benchmark numbers for the emulator later tonight. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-26 1:23 ` YiFei Zhu @ 2020-09-26 2:47 ` Andy Lutomirski 2020-09-26 4:35 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: Andy Lutomirski @ 2020-09-26 2:47 UTC (permalink / raw) To: YiFei Zhu Cc: Kees Cook, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry > On Sep 25, 2020, at 6:23 PM, YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > On Fri, Sep 25, 2020 at 4:07 PM Andy Lutomirski <luto@amacapital.net> wrote: >> We'd need at least three states per syscall: unknown, always-allow, >> and need-to-run-filter. >> >> The downsides are less determinism and a bit of an uglier >> implementation. The upside is that we don't need to loop over all >> syscalls at load -- instead the time that each operation takes is >> independent of the total number of syscalls on the system. And we can >> entirely avoid, say, evaluating the x32 case until the task tries an >> x32 syscall. > > I was really afraid of multiple tasks writing to the bitmaps at once, > hence I used bitmap-per-task. Now I think about it, if this stays > lockless, the worst thing that can happen is that a write undo a bit > set by another task. In this case, if the "known" bit is cleared then > the worst would be the emulation is run many times. But if the "always > allow" is cleared but not "known" bit then we have an issue: the > syscall will always be executed in BPF. > If you interleave the bits, then you can read and write them atomically — both bits for any given syscall will be in the same word. > Is it worth holding a spinlock here? > > Though I'll try to get the benchmark numbers for the emulator later tonight. > > YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent 2020-09-26 2:47 ` Andy Lutomirski @ 2020-09-26 4:35 ` Kees Cook 0 siblings, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-09-26 4:35 UTC (permalink / raw) To: Andy Lutomirski Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Sep 25, 2020 at 07:47:47PM -0700, Andy Lutomirski wrote: > > > On Sep 25, 2020, at 6:23 PM, YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > > > On Fri, Sep 25, 2020 at 4:07 PM Andy Lutomirski <luto@amacapital.net> wrote: > >> We'd need at least three states per syscall: unknown, always-allow, > >> and need-to-run-filter. > >> > >> The downsides are less determinism and a bit of an uglier > >> implementation. The upside is that we don't need to loop over all > >> syscalls at load -- instead the time that each operation takes is > >> independent of the total number of syscalls on the system. And we can > >> entirely avoid, say, evaluating the x32 case until the task tries an > >> x32 syscall. > > > > I was really afraid of multiple tasks writing to the bitmaps at once, > > hence I used bitmap-per-task. Now I think about it, if this stays > > lockless, the worst thing that can happen is that a write undo a bit > > set by another task. In this case, if the "known" bit is cleared then > > the worst would be the emulation is run many times. But if the "always > > allow" is cleared but not "known" bit then we have an issue: the > > syscall will always be executed in BPF. > > > > If you interleave the bits, then you can read and write them atomically — both bits for any given syscall will be in the same word. I think we can just hold the spinlock. :) -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path 2020-09-24 12:44 ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu ` (2 preceding siblings ...) 2020-09-24 12:44 ` [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu @ 2020-09-24 12:44 ` YiFei Zhu 2020-09-24 23:46 ` Kees Cook 2020-09-24 12:44 ` [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 5 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> The fast (common) path for seccomp should be that the filter permits the syscall to pass through, and failing seccomp is expected to be an exceptional case; it is not expected for userspace to call a denylisted syscall over and over. This first finds the current allow bitmask by iterating through syscall_arches[] array and comparing it to the one in struct seccomp_data; this loop is expected to be unrolled. It then does a test_bit against the bitmask. If the bit is set, then there is no need to run the full filter; it returns SECCOMP_RET_ALLOW immediately. Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- kernel/seccomp.c | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 20d33378a092..ac0266b6d18a 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -167,6 +167,12 @@ static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter, const struct seccomp_filter *prev) { } + +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + return false; +} #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ /** @@ -321,6 +327,34 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) return 0; } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * seccomp_cache_check - lookup seccomp cache + * @sfilter: The seccomp filter + * @sd: The seccomp data to lookup the cache with + * + * Returns true if the seccomp_data is cached and allowed. + */ +static bool seccomp_cache_check(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + int syscall_nr = sd->nr; + int arch; + + if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls)) + return false; + + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { + if (likely(syscall_arches[arch] == sd->arch)) + return test_bit(syscall_nr, + sfilter->cache.syscall_ok[arch]); + } + + WARN_ON_ONCE(true); + return false; +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_run_filters - evaluates all seccomp filters against @sd * @sd: optional seccomp data to be passed to filters @@ -343,6 +377,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, if (WARN_ON(f == NULL)) return SECCOMP_RET_KILL_PROCESS; + if (seccomp_cache_check(f, sd)) + return SECCOMP_RET_ALLOW; + /* * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path 2020-09-24 12:44 ` [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu @ 2020-09-24 23:46 ` Kees Cook 2020-09-25 1:55 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-24 23:46 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 07:44:19AM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > The fast (common) path for seccomp should be that the filter permits > the syscall to pass through, and failing seccomp is expected to be > an exceptional case; it is not expected for userspace to call a > denylisted syscall over and over. > > This first finds the current allow bitmask by iterating through > syscall_arches[] array and comparing it to the one in struct > seccomp_data; this loop is expected to be unrolled. It then > does a test_bit against the bitmask. If the bit is set, then > there is no need to run the full filter; it returns > SECCOMP_RET_ALLOW immediately. > > Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > --- > kernel/seccomp.c | 37 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 37 insertions(+) > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 20d33378a092..ac0266b6d18a 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -167,6 +167,12 @@ static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter, > const struct seccomp_filter *prev) > { > } > + > +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter, > + const struct seccomp_data *sd) > +{ > + return false; > +} > #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ > > /** > @@ -321,6 +327,34 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) > return 0; > } > > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +/** > + * seccomp_cache_check - lookup seccomp cache > + * @sfilter: The seccomp filter > + * @sd: The seccomp data to lookup the cache with > + * > + * Returns true if the seccomp_data is cached and allowed. > + */ > +static bool seccomp_cache_check(const struct seccomp_filter *sfilter, > + const struct seccomp_data *sd) > +{ > + int syscall_nr = sd->nr; > + int arch; > + > + if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls)) > + return false; This protects us from x32 (i.e. syscall_nr will have 0x40000000 bit set), but given the effort needed to support compat, I think supporting x32 isn't much more. (Though again, I note that NR_syscalls differs in size, so this test needs to be per-arch and obviously after arch-discovery.) That said, if it really does turn out that x32 is literally the only architecture doing these shenanigans (and I suspect not, given the MIPS case), okay, fine, I'll give in. :) You and Jann both seem to think this isn't worth it. > + > + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { > + if (likely(syscall_arches[arch] == sd->arch)) I think this linear search for the matching arch can be made O(1) (this is what I was trying to do in v1: we can map all possible combos to a distinct bitmap, so there is just math and lookup rather than a linear compare search. In the one-arch case, it can also be easily collapsed into a no-op (though my v1 didn't do this correctly). > + return test_bit(syscall_nr, > + sfilter->cache.syscall_ok[arch]); > + } > + > + WARN_ON_ONCE(true); > + return false; > +} > +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ > + > /** > * seccomp_run_filters - evaluates all seccomp filters against @sd > * @sd: optional seccomp data to be passed to filters > @@ -343,6 +377,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, > if (WARN_ON(f == NULL)) > return SECCOMP_RET_KILL_PROCESS; > > + if (seccomp_cache_check(f, sd)) > + return SECCOMP_RET_ALLOW; > + > /* > * All filters in the list are evaluated and the lowest BPF return > * value always takes priority (ignoring the DATA). > -- > 2.28.0 > -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path 2020-09-24 23:46 ` Kees Cook @ 2020-09-25 1:55 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-25 1:55 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 6:46 PM Kees Cook <keescook@chromium.org> wrote: > This protects us from x32 (i.e. syscall_nr will have 0x40000000 bit > set), but given the effort needed to support compat, I think supporting > x32 isn't much more. (Though again, I note that NR_syscalls differs in > size, so this test needs to be per-arch and obviously after > arch-discovery.) > > That said, if it really does turn out that x32 is literally the only > architecture doing these shenanigans (and I suspect not, given the MIPS > case), okay, fine, I'll give in. :) You and Jann both seem to think this > isn't worth it. MIPS has the sparse syscall shenanigans... idek how that works. Maybe someone can clarify? > I think this linear search for the matching arch can be made O(1) (this > is what I was trying to do in v1: we can map all possible combos to a > distinct bitmap, so there is just math and lookup rather than a linear > compare search. In the one-arch case, it can also be easily collapsed > into a no-op (though my v1 didn't do this correctly). I remember yours was: static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr) { [...] switch (syscall_arch) { case SECCOMP_ARCH: seccomp_arch = SECCOMP_ARCH_IS_NATIVE; break; #ifdef CONFIG_COMPAT case SECCOMP_ARCH_COMPAT: seccomp_arch = SECCOMP_ARCH_IS_COMPAT; break; #endif default: seccomp_arch = SECCOMP_ARCH_IS_UNKNOWN; } What I'm relying on here is that the compiler will unroll the loop. How does the compiler perform switch statements? I was imagining it would be similar, with "case" corresponding to a compare on the immediate, and the assign as a move to a register, and break corresponding to a jump. this would also be O(n) to the number of arches. Yes, compilers can also do an O(1) table lookup, but that is nonsensical here -- the arch numbers occupy the MSBs. That said, does O(1) or O(n) matter here? Given that n is at most 3 you might as well consider it a constant. Also, does "collapse in one arch case" actually worth it? Given that there's a likely(), and the other side is a WARN_ON_ONCE(), the compiler will layout the likely path in the fast path and branch prediction will be in our favor, right? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead 2020-09-24 12:44 ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu ` (3 preceding siblings ...) 2020-09-24 12:44 ` [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu @ 2020-09-24 12:44 ` YiFei Zhu 2020-09-24 23:47 ` Kees Cook 2020-09-24 12:44 ` [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 5 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: Kees Cook <keescook@chromium.org> As part of the seccomp benchmarking, include the expectations with regard to the timing behavior of the constant action bitmaps, and report inconsistencies better. Example output with constant action bitmaps on x86: $ sudo ./seccomp_benchmark 100000000 Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 100000000 syscalls... 63.896255358 - 0.008504529 = 63887750829 (63.9s) getpid native: 638 ns 130.383312423 - 63.897315189 = 66485997234 (66.5s) getpid RET_ALLOW 1 filter (bitmap): 664 ns 196.789080421 - 130.384414983 = 66404665438 (66.4s) getpid RET_ALLOW 2 filters (bitmap): 664 ns 268.844643304 - 196.790234168 = 72054409136 (72.1s) getpid RET_ALLOW 3 filters (full): 720 ns 342.627472515 - 268.845799103 = 73781673412 (73.8s) getpid RET_ALLOW 4 filters (full): 737 ns Estimated total seccomp overhead for 1 bitmapped filter: 26 ns Estimated total seccomp overhead for 2 bitmapped filters: 26 ns Estimated total seccomp overhead for 3 full filters: 82 ns Estimated total seccomp overhead for 4 full filters: 99 ns Estimated seccomp entry overhead: 26 ns Estimated seccomp per-filter overhead (last 2 diff): 17 ns Estimated seccomp per-filter overhead (filters / 4): 18 ns Expectations: native ≤ 1 bitmap (638 ≤ 664): ✔️ native ≤ 1 filter (638 ≤ 720): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️ 1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️ entry ≈ 1 bitmapped (26 ≈ 26): ✔️ entry ≈ 2 bitmapped (26 ≈ 26): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️ Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- .../selftests/seccomp/seccomp_benchmark.c | 151 +++++++++++++++--- tools/testing/selftests/seccomp/settings | 2 +- 2 files changed, 130 insertions(+), 23 deletions(-) diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c index 91f5a89cadac..fcc806585266 100644 --- a/tools/testing/selftests/seccomp/seccomp_benchmark.c +++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c @@ -4,12 +4,16 @@ */ #define _GNU_SOURCE #include <assert.h> +#include <limits.h> +#include <stdbool.h> +#include <stddef.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <linux/filter.h> #include <linux/seccomp.h> +#include <sys/param.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <sys/types.h> @@ -70,18 +74,74 @@ unsigned long long calibrate(void) return samples * seconds; } +bool approx(int i_one, int i_two) +{ + double one = i_one, one_bump = one * 0.01; + double two = i_two, two_bump = two * 0.01; + + one_bump = one + MAX(one_bump, 2.0); + two_bump = two + MAX(two_bump, 2.0); + + /* Equal to, or within 1% or 2 digits */ + if (one == two || + (one > two && one <= two_bump) || + (two > one && two <= one_bump)) + return true; + return false; +} + +bool le(int i_one, int i_two) +{ + if (i_one <= i_two) + return true; + return false; +} + +long compare(const char *name_one, const char *name_eval, const char *name_two, + unsigned long long one, bool (*eval)(int, int), unsigned long long two) +{ + bool good; + + printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two, + (long long)one, name_eval, (long long)two); + if (one > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)one); + return 1; + } + if (two > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)two); + return 1; + } + + good = eval(one, two); + printf("%s\n", good ? "✔️" : "❌"); + + return good ? 0 : 1; +} + int main(int argc, char *argv[]) { + struct sock_filter bitmap_filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), + }; + struct sock_fprog bitmap_prog = { + .len = (unsigned short)ARRAY_SIZE(bitmap_filter), + .filter = bitmap_filter, + }; struct sock_filter filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog prog = { .len = (unsigned short)ARRAY_SIZE(filter), .filter = filter, }; - long ret; - unsigned long long samples; - unsigned long long native, filter1, filter2; + + long ret, bits; + unsigned long long samples, calc; + unsigned long long native, filter1, filter2, bitmap1, bitmap2; + unsigned long long entry, per_filter1, per_filter2; printf("Current BPF sysctl settings:\n"); system("sysctl net.core.bpf_jit_enable"); @@ -101,35 +161,82 @@ int main(int argc, char *argv[]) ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); assert(ret == 0); - /* One filter */ - ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); + /* One filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); assert(ret == 0); - filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1); + bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1); + + /* Second filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - if (filter1 == native) - printf("No overhead measured!? Try running again with more samples.\n"); + bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2); - /* Two filters */ + /* Third filter, can no longer be converted to bitmap */ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); assert(ret == 0); - filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2); - - /* Calculations */ - printf("Estimated total seccomp overhead for 1 filter: %llu ns\n", - filter1 - native); + filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1); - printf("Estimated total seccomp overhead for 2 filters: %llu ns\n", - filter2 - native); + /* Fourth filter, can not be converted to bitmap because of filter 3 */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - printf("Estimated seccomp per-filter overhead: %llu ns\n", - filter2 - filter1); + filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2); + + /* Estimations */ +#define ESTIMATE(fmt, var, what) do { \ + var = (what); \ + printf("Estimated " fmt ": %llu ns\n", var); \ + if (var > INT_MAX) \ + goto more_samples; \ + } while (0) + + ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc, + bitmap1 - native); + ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc, + bitmap2 - native); + ESTIMATE("total seccomp overhead for 3 full filters", calc, + filter1 - native); + ESTIMATE("total seccomp overhead for 4 full filters", calc, + filter2 - native); + ESTIMATE("seccomp entry overhead", entry, + bitmap1 - native - (bitmap2 - bitmap1)); + ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1, + filter2 - filter1); + ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2, + (filter2 - native - entry) / 4); + + printf("Expectations:\n"); + ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1); + bits = compare("native", "≤", "1 filter", native, le, filter1); + if (bits) + goto more_samples; + + ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)", + per_filter1, approx, per_filter2); + + bits = compare("1 bitmapped", "≈", "2 bitmapped", + bitmap1 - native, approx, bitmap2 - native); + if (bits) { + printf("Skipping constant action bitmap expectations: they appear unsupported.\n"); + goto out; + } - printf("Estimated seccomp entry overhead: %llu ns\n", - filter1 - native - (filter2 - filter1)); + ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native); + ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native); + ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total", + entry + (per_filter1 * 4) + native, approx, filter2); + if (ret == 0) + goto out; +more_samples: + printf("Saw unexpected benchmark result. Try running again with more samples?\n"); +out: return 0; } diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings index ba4d85f74cd6..6091b45d226b 100644 --- a/tools/testing/selftests/seccomp/settings +++ b/tools/testing/selftests/seccomp/settings @@ -1 +1 @@ -timeout=90 +timeout=120 -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead 2020-09-24 12:44 ` [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu @ 2020-09-24 23:47 ` Kees Cook 2020-09-25 1:35 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-24 23:47 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 07:44:20AM -0500, YiFei Zhu wrote: > From: Kees Cook <keescook@chromium.org> > > As part of the seccomp benchmarking, include the expectations with > regard to the timing behavior of the constant action bitmaps, and report > inconsistencies better. > > Example output with constant action bitmaps on x86: > > $ sudo ./seccomp_benchmark 100000000 > Current BPF sysctl settings: > net.core.bpf_jit_enable = 1 > net.core.bpf_jit_harden = 0 > Benchmarking 100000000 syscalls... > 63.896255358 - 0.008504529 = 63887750829 (63.9s) > getpid native: 638 ns > 130.383312423 - 63.897315189 = 66485997234 (66.5s) > getpid RET_ALLOW 1 filter (bitmap): 664 ns > 196.789080421 - 130.384414983 = 66404665438 (66.4s) > getpid RET_ALLOW 2 filters (bitmap): 664 ns > 268.844643304 - 196.790234168 = 72054409136 (72.1s) > getpid RET_ALLOW 3 filters (full): 720 ns > 342.627472515 - 268.845799103 = 73781673412 (73.8s) > getpid RET_ALLOW 4 filters (full): 737 ns > Estimated total seccomp overhead for 1 bitmapped filter: 26 ns > Estimated total seccomp overhead for 2 bitmapped filters: 26 ns > Estimated total seccomp overhead for 3 full filters: 82 ns > Estimated total seccomp overhead for 4 full filters: 99 ns > Estimated seccomp entry overhead: 26 ns > Estimated seccomp per-filter overhead (last 2 diff): 17 ns > Estimated seccomp per-filter overhead (filters / 4): 18 ns > Expectations: > native ≤ 1 bitmap (638 ≤ 664): ✔️ > native ≤ 1 filter (638 ≤ 720): ✔️ > per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️ > 1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️ > entry ≈ 1 bitmapped (26 ≈ 26): ✔️ > entry ≈ 2 bitmapped (26 ≈ 26): ✔️ > native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️ > > Signed-off-by: Kees Cook <keescook@chromium.org> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> BTW, did this benchmark tool's results match your expectations from what you saw with your RFC? (I assume it helped since you've included in here.) -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead 2020-09-24 23:47 ` Kees Cook @ 2020-09-25 1:35 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-25 1:35 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 6:47 PM Kees Cook <keescook@chromium.org> wrote: > BTW, did this benchmark tool's results match your expectations from what > you saw with your RFC? (I assume it helped since you've included in > here.) Yes, I updated the commit message with the benchmarks of this patch series. Though, given that I'm running in a qemu-kvm on my laptop that has a lot of stuffs running on it (and with the cursed ThinkPad T480 CPU throttling), I had to throw much more syscalls at it to pass the "approximately equals" expectation... though no idea about what's going on with 732 vs 737. Or if you mean if I expected these results, yes. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-24 12:44 ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu ` (4 preceding siblings ...) 2020-09-24 12:44 ` [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu @ 2020-09-24 12:44 ` YiFei Zhu 2020-09-24 23:56 ` Kees Cook 5 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Currently the kernel does not provide an infrastructure to translate architecture numbers to a human-readable name. Translating syscall numbers to syscall names is possible through FTRACE_SYSCALL infrastructure but it does not provide support for compat syscalls. This will create a file for each PID as /proc/pid/seccomp_cache. The file will be empty when no seccomp filters are loaded, or be in the format of: <hex arch number> <decimal syscall number> <ALLOW | FILTER> where ALLOW means the cache is guaranteed to allow the syscall, and filter means the cache will pass the syscall to the BPF filter. For the docker default profile on x86_64 it looks like: c000003e 0 ALLOW c000003e 1 ALLOW c000003e 2 ALLOW c000003e 3 ALLOW [...] c000003e 132 ALLOW c000003e 133 ALLOW c000003e 134 FILTER c000003e 135 FILTER c000003e 136 FILTER c000003e 137 ALLOW c000003e 138 ALLOW c000003e 139 FILTER c000003e 140 ALLOW c000003e 141 ALLOW [...] This file is guarded by CONFIG_PROC_SECCOMP_CACHE with a default of N because I think certain users of seecomp might not want the application to know which syscalls are definitely usable. I'm not sure if adding all the "human readable names" is worthwhile, considering it can be easily done in userspace. Suggested-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/Kconfig | 10 ++++++++++ fs/proc/base.c | 7 +++++-- include/linux/seccomp.h | 5 +++++ kernel/seccomp.c | 26 ++++++++++++++++++++++++++ 4 files changed, 46 insertions(+), 2 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 8cc3dc87f253..dbfd897e5dc0 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -514,6 +514,16 @@ config SECCOMP_CACHE_NR_ONLY endchoice +config PROC_SECCOMP_CACHE + bool "Show seccomp filter cache status in /proc/pid/seccomp_cache" + depends on SECCOMP_CACHE_NR_ONLY + depends on PROC_FS + help + This is enables /proc/pid/seccomp_cache interface to monitor + seccomp cache data. The file format is subject to change. + + If unsure, say N. + config HAVE_ARCH_STACKLEAK bool help diff --git a/fs/proc/base.c b/fs/proc/base.c index 617db4e0faa0..2af626f69fa1 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2615,7 +2615,7 @@ static struct dentry *proc_pident_instantiate(struct dentry *dentry, return d_splice_alias(inode, dentry); } -static struct dentry *proc_pident_lookup(struct inode *dir, +static struct dentry *proc_pident_lookup(struct inode *dir, struct dentry *dentry, const struct pid_entry *p, const struct pid_entry *end) @@ -2815,7 +2815,7 @@ static const struct pid_entry attr_dir_stuff[] = { static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx) { - return proc_pident_readdir(file, ctx, + return proc_pident_readdir(file, ctx, attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff)); } @@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_PROC_PID_ARCH_STATUS ONE("arch_status", S_IRUGO, proc_pid_arch_status), #endif +#ifdef CONFIG_PROC_SECCOMP_CACHE + ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache), +#endif }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 02aef2844c38..3cedec824365 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -121,4 +121,9 @@ static inline long seccomp_get_metadata(struct task_struct *task, return -EINVAL; } #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */ + +#ifdef CONFIG_PROC_SECCOMP_CACHE +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task); +#endif #endif /* _LINUX_SECCOMP_H */ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index ac0266b6d18a..d97ec1876b4e 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -2293,3 +2293,29 @@ static int __init seccomp_sysctl_init(void) device_initcall(seccomp_sysctl_init) #endif /* CONFIG_SYSCTL */ + +#ifdef CONFIG_PROC_SECCOMP_CACHE +/* Currently CONFIG_PROC_SECCOMP_CACHE implies CONFIG_SECCOMP_CACHE_NR_ONLY */ +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task) +{ + struct seccomp_filter *f = READ_ONCE(task->seccomp.filter); + int arch, nr; + + if (!f) + return 0; + + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { + for (nr = 0; nr < NR_syscalls; nr++) { + bool cached = test_bit(nr, f->cache.syscall_ok[arch]); + char *status = cached ? "ALLOW" : "FILTER"; + + seq_printf(m, "%08x %d %s\n", syscall_arches[arch], + nr, status + ); + } + } + + return 0; +} +#endif /* CONFIG_PROC_SECCOMP_CACHE */ -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-24 12:44 ` [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu @ 2020-09-24 23:56 ` Kees Cook 2020-09-25 3:11 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-24 23:56 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 07:44:21AM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > Currently the kernel does not provide an infrastructure to translate > architecture numbers to a human-readable name. Translating syscall > numbers to syscall names is possible through FTRACE_SYSCALL > infrastructure but it does not provide support for compat syscalls. > > This will create a file for each PID as /proc/pid/seccomp_cache. > The file will be empty when no seccomp filters are loaded, or be > in the format of: > <hex arch number> <decimal syscall number> <ALLOW | FILTER> > where ALLOW means the cache is guaranteed to allow the syscall, > and filter means the cache will pass the syscall to the BPF filter. > > For the docker default profile on x86_64 it looks like: > c000003e 0 ALLOW > c000003e 1 ALLOW > c000003e 2 ALLOW > c000003e 3 ALLOW > [...] > c000003e 132 ALLOW > c000003e 133 ALLOW > c000003e 134 FILTER > c000003e 135 FILTER > c000003e 136 FILTER > c000003e 137 ALLOW > c000003e 138 ALLOW > c000003e 139 FILTER > c000003e 140 ALLOW > c000003e 141 ALLOW > [...] > > This file is guarded by CONFIG_PROC_SECCOMP_CACHE with a default > of N because I think certain users of seecomp might not want the > application to know which syscalls are definitely usable. > > I'm not sure if adding all the "human readable names" is worthwhile, > considering it can be easily done in userspace. The question of permissions is my central concern here: who should see this? Some contained processes have been intentionally blocked from self-introspection so even the "standard" high bar of "ptrace attach allowed?" can't always be sufficient. My compromise about filter visibility in the past was saying that CAP_SYS_ADMIN was required (see seccomp_get_filter()). I'm nervous to weaken this. (There is some work that hasn't been sent upstream yet that is looking to expose the filter _contents_ via /proc that has been nervous too.) Now full contents vs "allow"/"filter" are certainly different things, but I don't feel like I've got enough evidence to show that this introspection would help debugging enough to justify the partially imagined safety of not exposing it to potential attackers. I suspect it _is_ the right thing to do (just look at my own RFC's "debug" patch), but I'd like this to be well justified in the commit log. And yes, while it does hide behind a CONFIG, I'd still want it justified, especially since distros have a tendency to just turn everything on anyway. ;) > + for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) { > + for (nr = 0; nr < NR_syscalls; nr++) { > + bool cached = test_bit(nr, f->cache.syscall_ok[arch]); > + char *status = cached ? "ALLOW" : "FILTER"; > + > + seq_printf(m, "%08x %d %s\n", syscall_arches[arch], > + nr, status > + ); > + } > + } But behavior-wise, yeah, I like it; I'm fine with human-readable and full AUDIT_ARCH values. (Though, as devil's advocate again, to repeat Jann's own words back: do we want to add this only to have a new UAPI to support going forward?) -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-24 23:56 ` Kees Cook @ 2020-09-25 3:11 ` YiFei Zhu 2020-09-25 3:26 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-25 3:11 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 6:56 PM Kees Cook <keescook@chromium.org> wrote: > > This file is guarded by CONFIG_PROC_SECCOMP_CACHE with a default > The question of permissions is my central concern here: who should see > this? Some contained processes have been intentionally blocked from > self-introspection so even the "standard" high bar of "ptrace attach > allowed?" can't always be sufficient. > > My compromise about filter visibility in the past was saying that > CAP_SYS_ADMIN was required (see seccomp_get_filter()). I'm nervous to > weaken this. (There is some work that hasn't been sent upstream yet that > is looking to expose the filter _contents_ via /proc that has been > nervous too.) > > Now full contents vs "allow"/"filter" are certainly different things, > but I don't feel like I've got enough evidence to show that this > introspection would help debugging enough to justify the partially > imagined safety of not exposing it to potential attackers. Agreed. I'm inclined to make it CONFIG_DEBUG_SECCOMP_CACHE and guarded by a CAP just to make it "debug only". > I suspect it _is_ the right thing to do (just look at my own RFC's > "debug" patch), but I'd like this to be well justified in the commit > log. > > And yes, while it does hide behind a CONFIG, I'd still want it justified, > especially since distros have a tendency to just turn everything on > anyway. ;) Is there something to stop a config from being enabled in an allyesconfig? I remember seeing something like that. Else if someone is manually selecting we can add a help text with a big banner... > But behavior-wise, yeah, I like it; I'm fine with human-readable and > full AUDIT_ARCH values. (Though, as devil's advocate again, to repeat > Jann's own words back: do we want to add this only to have a new UAPI to > support going forward?) Is this something we want to keep stable? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-25 3:11 ` YiFei Zhu @ 2020-09-25 3:26 ` Kees Cook 0 siblings, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-09-25 3:26 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Sep 24, 2020 at 10:11:17PM -0500, YiFei Zhu wrote: > On Thu, Sep 24, 2020 at 6:56 PM Kees Cook <keescook@chromium.org> wrote: > > > This file is guarded by CONFIG_PROC_SECCOMP_CACHE with a default > > The question of permissions is my central concern here: who should see > > this? Some contained processes have been intentionally blocked from > > self-introspection so even the "standard" high bar of "ptrace attach > > allowed?" can't always be sufficient. > > > > My compromise about filter visibility in the past was saying that > > CAP_SYS_ADMIN was required (see seccomp_get_filter()). I'm nervous to > > weaken this. (There is some work that hasn't been sent upstream yet that > > is looking to expose the filter _contents_ via /proc that has been > > nervous too.) > > > > Now full contents vs "allow"/"filter" are certainly different things, > > but I don't feel like I've got enough evidence to show that this > > introspection would help debugging enough to justify the partially > > imagined safety of not exposing it to potential attackers. > > Agreed. I'm inclined to make it CONFIG_DEBUG_SECCOMP_CACHE and guarded > by a CAP just to make it "debug only". Yeah; I just can't quite see what the best direction is here. I will ponder this more. As I mentioned, it does seem handy. :) > Is there something to stop a config from being enabled in an > allyesconfig? I remember seeing something like that. Else if someone > is manually selecting we can add a help text with a big banner... Yeah, allyesconfig and allmodconfig both effectively set CONFIG_COMPILE_TEST. Anyway, likely a caps test will end up being the way to do it. > > > But behavior-wise, yeah, I like it; I'm fine with human-readable and > > full AUDIT_ARCH values. (Though, as devil's advocate again, to repeat > > Jann's own words back: do we want to add this only to have a new UAPI to > > support going forward?) > > Is this something we want to keep stable? The Prime Directive of "never break userspace" is really "never break userspace in a way that someone notices". So if nothing ever parses that file, then we don't have to keep it stable, but if something does, and we change it, we have to fix it. So, a capability test means very few things will touch it, and if we decide it's not a big deal, we can relax permissions in the future. -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results 2020-09-21 5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu ` (7 preceding siblings ...) 2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu @ 2020-09-30 15:19 ` YiFei Zhu 2020-09-30 15:19 ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu ` (5 more replies) 8 siblings, 6 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/ Major differences from the linked alternative by Kees: * No x32 special-case handling -- not worth the complexity * No caching of denylist -- not worth the complexity * No seccomp arch pinning -- I think this is an independent feature * The bitmaps are part of the filters rather than the task. * Architectures supported by default through arch number array, except for MIPS with its sparse syscall numbers. * Configurable per-build for future different cache modes. This series adds a bitmap to cache seccomp filter results if the result permits a syscall and is indepenent of syscall arguments. This visibly decreases seccomp overhead for most common seccomp filters with very little memory footprint. The overhead of running Seccomp filters has been part of some past discussions [1][2][3]. Oftentimes, the filters have a large number of instructions that check syscall numbers one by one and jump based on that. Some users chain BPF filters which further enlarge the overhead. A recent work [6] comprehensively measures the Seccomp overhead and shows that the overhead is non-negligible and has a non-trivial impact on application performance. We observed some common filters, such as docker's [4] or systemd's [5], will make most decisions based only on the syscall numbers, and as past discussions considered, a bitmap where each bit represents a syscall makes most sense for these filters. In order to build this bitmap at filter attach time, each filter is emulated for every syscall (under each possible architecture), and checked for any accesses of struct seccomp_data that are not the "arch" nor "nr" (syscall) members. If only "arch" and "nr" are examined, and the program returns allow, then we can be sure that the filter must return allow independent from syscall arguments. When it is concluded that an allow must occur for the given architecture and syscall pair, seccomp will immediately allow the syscall, bypassing further BPF execution. Ongoing work is to further support arguments with fast hash table lookups. We are investigating the performance of doing so [6], and how to best integrate with the existing seccomp infrastructure. Some benchmarks are performed with results in patch 5, copied below: Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 200000000 syscalls... 129.359381409 - 0.008724424 = 129350656985 (129.4s) getpid native: 646 ns 264.385890006 - 129.360453229 = 135025436777 (135.0s) getpid RET_ALLOW 1 filter (bitmap): 675 ns 399.400511893 - 264.387045901 = 135013465992 (135.0s) getpid RET_ALLOW 2 filters (bitmap): 675 ns 545.872866260 - 399.401718327 = 146471147933 (146.5s) getpid RET_ALLOW 3 filters (full): 732 ns 696.337101319 - 545.874097681 = 150463003638 (150.5s) getpid RET_ALLOW 4 filters (full): 752 ns Estimated total seccomp overhead for 1 bitmapped filter: 29 ns Estimated total seccomp overhead for 2 bitmapped filters: 29 ns Estimated total seccomp overhead for 3 full filters: 86 ns Estimated total seccomp overhead for 4 full filters: 106 ns Estimated seccomp entry overhead: 29 ns Estimated seccomp per-filter overhead (last 2 diff): 20 ns Estimated seccomp per-filter overhead (filters / 4): 19 ns Expectations: native ≤ 1 bitmap (646 ≤ 675): ✔️ native ≤ 1 filter (646 ≤ 732): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️ 1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️ entry ≈ 1 bitmapped (29 ≈ 29): ✔️ entry ≈ 2 bitmapped (29 ≈ 29): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️ v2 -> v3: * Added array_index_nospec guards * No more syscall_arches[] array and expecting on loop unrolling. Arches are configured with per-arch seccomp.h. * Moved filter emulation to attach time (from prepare time). * Further simplified emulator, basing on Kees's code. * Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN. v1 -> v2: * Corrected one outdated function documentation. RFC -> v1: * Config made on by default across all arches that could support it. * Added arch numbers array and emulate filter for each arch number, and have a per-arch bitmap. * Massively simplified the emulator so it would only support the common instructions in Kees's list. * Fixed inheriting bitmap across filters (filter->prev is always NULL during prepare). * Stole the selftest from Kees. * Added a /proc/pid/seccomp_cache by Jann's suggestion. Patch 1 adds the arch macros for x86. Patch 2 implements the emulator that finds if a filter must return allow, Patch 3 implements the test_bit against the bitmaps. Patch 4 updates the selftest to better show the new semantics. Patch 5 implements /proc/pid/seccomp_cache. [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ [3] https://github.com/seccomp/libseccomp/issues/116 [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 [6] Draco: Architectural and Operating System Support for System Call Security https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 Kees Cook (2): x86: Enable seccomp architecture tracking selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu (3): seccomp/cache: Add "emulator" to check if filter is constant allow seccomp/cache: Lookup syscall allowlist for fast path seccomp/cache: Report cache data through /proc/pid/seccomp_cache arch/Kconfig | 49 ++++ arch/x86/Kconfig | 1 + arch/x86/include/asm/seccomp.h | 15 + fs/proc/base.c | 3 + include/linux/seccomp.h | 5 + kernel/seccomp.c | 265 +++++++++++++++++- .../selftests/seccomp/seccomp_benchmark.c | 151 ++++++++-- tools/testing/selftests/seccomp/settings | 2 +- 8 files changed, 467 insertions(+), 24 deletions(-) -- 2.28.0 ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking 2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu @ 2020-09-30 15:19 ` YiFei Zhu 2020-09-30 21:21 ` Kees Cook 2020-09-30 15:19 ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu ` (4 subsequent siblings) 5 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: Kees Cook <keescook@chromium.org> Provide seccomp internals with the details to calculate which syscall table the running kernel is expecting to deal with. This allows for efficient architecture pinning and paves the way for constant-action bitmaps. Signed-off-by: Kees Cook <keescook@chromium.org> [YiFei: Removed x32, added macro for nr_syscalls] Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/x86/include/asm/seccomp.h | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h index 2bd1338de236..7b3a58271656 100644 --- a/arch/x86/include/asm/seccomp.h +++ b/arch/x86/include/asm/seccomp.h @@ -16,6 +16,18 @@ #define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn #endif +#ifdef CONFIG_X86_64 +# define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_X86_64 +# define SECCOMP_ARCH_DEFAULT_NR NR_syscalls +# ifdef CONFIG_COMPAT +# define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 +# define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls +# endif +#else /* !CONFIG_X86_64 */ +# define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_I386 +# define SECCOMP_ARCH_DEFAULT_NR NR_syscalls +#endif + #include <asm-generic/seccomp.h> #endif /* _ASM_X86_SECCOMP_H */ -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking 2020-09-30 15:19 ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu @ 2020-09-30 21:21 ` Kees Cook 2020-09-30 21:33 ` Jann Horn 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-30 21:21 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 10:19:12AM -0500, YiFei Zhu wrote: > From: Kees Cook <keescook@chromium.org> > > Provide seccomp internals with the details to calculate which syscall > table the running kernel is expecting to deal with. This allows for > efficient architecture pinning and paves the way for constant-action > bitmaps. > > Signed-off-by: Kees Cook <keescook@chromium.org> > [YiFei: Removed x32, added macro for nr_syscalls] > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > --- > arch/x86/include/asm/seccomp.h | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h > index 2bd1338de236..7b3a58271656 100644 > --- a/arch/x86/include/asm/seccomp.h > +++ b/arch/x86/include/asm/seccomp.h > @@ -16,6 +16,18 @@ > #define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn > #endif > > +#ifdef CONFIG_X86_64 > +# define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_X86_64 > +# define SECCOMP_ARCH_DEFAULT_NR NR_syscalls bikeshedding: let's call these SECCOMP_ARCH_NATIVE* -- I think it's more descriptive. > +# ifdef CONFIG_COMPAT > +# define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 > +# define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls > +# endif > +#else /* !CONFIG_X86_64 */ > +# define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_I386 > +# define SECCOMP_ARCH_DEFAULT_NR NR_syscalls > +#endif > + > #include <asm-generic/seccomp.h> > > #endif /* _ASM_X86_SECCOMP_H */ > -- > 2.28.0 > But otherwise, yes, looks good to me. For this patch, I think the S-o-b chain is probably more accurately captured as: Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking 2020-09-30 21:21 ` Kees Cook @ 2020-09-30 21:33 ` Jann Horn 2020-09-30 22:53 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: Jann Horn @ 2020-09-30 21:33 UTC (permalink / raw) To: Kees Cook Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 11:21 PM Kees Cook <keescook@chromium.org> wrote: > On Wed, Sep 30, 2020 at 10:19:12AM -0500, YiFei Zhu wrote: > > From: Kees Cook <keescook@chromium.org> > > > > Provide seccomp internals with the details to calculate which syscall > > table the running kernel is expecting to deal with. This allows for > > efficient architecture pinning and paves the way for constant-action > > bitmaps. > > > > Signed-off-by: Kees Cook <keescook@chromium.org> > > [YiFei: Removed x32, added macro for nr_syscalls] > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> [...] > But otherwise, yes, looks good to me. For this patch, I think the S-o-b chain is probably more > accurately captured as: > > Signed-off-by: Kees Cook <keescook@chromium.org> > Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> (Technically, https://www.kernel.org/doc/html/latest/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by says that "every Co-developed-by: must be immediately followed by a Signed-off-by: of the associated co-author" (and has an example of how that should look).) ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking 2020-09-30 21:33 ` Jann Horn @ 2020-09-30 22:53 ` Kees Cook 2020-09-30 23:15 ` Jann Horn 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-30 22:53 UTC (permalink / raw) To: Jann Horn Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 11:33:15PM +0200, Jann Horn wrote: > On Wed, Sep 30, 2020 at 11:21 PM Kees Cook <keescook@chromium.org> wrote: > > On Wed, Sep 30, 2020 at 10:19:12AM -0500, YiFei Zhu wrote: > > > From: Kees Cook <keescook@chromium.org> > > > > > > Provide seccomp internals with the details to calculate which syscall > > > table the running kernel is expecting to deal with. This allows for > > > efficient architecture pinning and paves the way for constant-action > > > bitmaps. > > > > > > Signed-off-by: Kees Cook <keescook@chromium.org> > > > [YiFei: Removed x32, added macro for nr_syscalls] > > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > [...] > > But otherwise, yes, looks good to me. For this patch, I think the S-o-b chain is probably more > > accurately captured as: > > > > Signed-off-by: Kees Cook <keescook@chromium.org> > > Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu> > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > > (Technically, https://www.kernel.org/doc/html/latest/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by > says that "every Co-developed-by: must be immediately followed by a > Signed-off-by: of the associated co-author" (and has an example of how > that should look).) Right, but it is not needed for the commit author (here, the From:), the second example given in the docs shows this: From: From Author <from@author.example.org> <changelog> Co-developed-by: Random Co-Author <random@coauthor.example.org> Signed-off-by: Random Co-Author <random@coauthor.example.org> Signed-off-by: From Author <from@author.example.org> Co-developed-by: Submitting Co-Author <sub@coauthor.example.org> Signed-off-by: Submitting Co-Author <sub@coauthor.example.org> and there is no third co-developer, so it's: From: From Author <from@author.example.org> <changelog> Signed-off-by: From Author <from@author.example.org> Co-developed-by: Submitting Co-Author <sub@coauthor.example.org> Signed-off-by: Submitting Co-Author <sub@coauthor.example.org> If I'm the From, and YiFei Zhu is the submitting co-developer, then it's: From: Kees Cook <keescook@chromium.org> <changelog> Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> which is what I suggested. -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking 2020-09-30 22:53 ` Kees Cook @ 2020-09-30 23:15 ` Jann Horn 0 siblings, 0 replies; 149+ messages in thread From: Jann Horn @ 2020-09-30 23:15 UTC (permalink / raw) To: Kees Cook Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 1, 2020 at 12:53 AM Kees Cook <keescook@chromium.org> wrote: > > On Wed, Sep 30, 2020 at 11:33:15PM +0200, Jann Horn wrote: > > On Wed, Sep 30, 2020 at 11:21 PM Kees Cook <keescook@chromium.org> wrote: > > > On Wed, Sep 30, 2020 at 10:19:12AM -0500, YiFei Zhu wrote: > > > > From: Kees Cook <keescook@chromium.org> > > > > > > > > Provide seccomp internals with the details to calculate which syscall > > > > table the running kernel is expecting to deal with. This allows for > > > > efficient architecture pinning and paves the way for constant-action > > > > bitmaps. > > > > > > > > Signed-off-by: Kees Cook <keescook@chromium.org> > > > > [YiFei: Removed x32, added macro for nr_syscalls] > > > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > > [...] > > > But otherwise, yes, looks good to me. For this patch, I think the S-o-b chain is probably more > > > accurately captured as: > > > > > > Signed-off-by: Kees Cook <keescook@chromium.org> > > > Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu> > > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > > > > (Technically, https://www.kernel.org/doc/html/latest/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by > > says that "every Co-developed-by: must be immediately followed by a > > Signed-off-by: of the associated co-author" (and has an example of how > > that should look).) > > Right, but it is not needed for the commit author (here, the From:), > the second example given in the docs shows this: Aah, right. Nevermind, sorry for the noise. ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 2020-09-30 15:19 ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu @ 2020-09-30 15:19 ` YiFei Zhu 2020-09-30 22:24 ` Jann Horn ` (2 more replies) 2020-09-30 15:19 ` [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu ` (3 subsequent siblings) 5 siblings, 3 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not access any syscall arguments or instruction pointer. To facilitate this we need a static analyser to know whether a filter will return allow regardless of syscall arguments for a given architecture number / syscall number pair. This is implemented here with a pseudo-emulator, and stored in a per-filter bitmap. Each common BPF instruction are emulated. Any weirdness or loading from a syscall argument will cause the emulator to bail. The emulation is also halted if it reaches a return. In that case, if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. Emulator structure and comments are from Kees [1] and Jann [2]. Emulation is done at attach time. If a filter depends on more filters, and if the dependee does not guarantee to allow the syscall, then we skip the emulation of this syscall. [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/Kconfig | 34 ++++++++++ arch/x86/Kconfig | 1 + kernel/seccomp.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 201 insertions(+), 1 deletion(-) diff --git a/arch/Kconfig b/arch/Kconfig index 21a3675a7a3a..ca867b2a5d71 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -471,6 +471,14 @@ config HAVE_ARCH_SECCOMP_FILTER results in the system call being skipped immediately. - seccomp syscall wired up +config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY + bool + help + An arch should select this symbol if it provides all of these things: + - all the requirements for HAVE_ARCH_SECCOMP_FILTER + - SECCOMP_ARCH_DEFAULT + - SECCOMP_ARCH_DEFAULT_NR + config SECCOMP prompt "Enable seccomp to safely execute untrusted bytecode" def_bool y @@ -498,6 +506,32 @@ config SECCOMP_FILTER See Documentation/userspace-api/seccomp_filter.rst for details. +choice + prompt "Seccomp filter cache" + default SECCOMP_CACHE_NONE + depends on SECCOMP_FILTER + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY + help + Seccomp filters can potentially incur large overhead for each + system call. This can alleviate some of the overhead. + + If in doubt, select 'syscall numbers only'. + +config SECCOMP_CACHE_NONE + bool "None" + help + No caching is done. Seccomp filters will be called each time + a system call occurs in a seccomp-guarded task. + +config SECCOMP_CACHE_NR_ONLY + bool "Syscall number only" + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY + help + For each syscall number, if the seccomp filter has a fixed + result, store that result in a bitmap to speed up system calls. + +endchoice + config HAVE_ARCH_STACKLEAK bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1ab22869a765..ff5289228ea5 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -150,6 +150,7 @@ config X86 select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_STACKLEAK select HAVE_ARCH_TRACEHOOK diff --git a/kernel/seccomp.c b/kernel/seccomp.c index ae6b40cc39f4..f09c9e74ae05 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -143,6 +143,37 @@ struct notification { struct list_head notifications; }; +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * struct seccomp_cache_filter_data - container for cache's per-filter data + * + * Tis struct is ordered to minimize padding holes. + * + * @syscall_allow_default: A bitmap where each bit represents whether the + * filter willalways allow the syscall, for the + * default architecture. + * @syscall_allow_compat: A bitmap where each bit represents whether the + * filter will always allow the syscall, for the + * compat architecture. + */ +struct seccomp_cache_filter_data { +#ifdef SECCOMP_ARCH_DEFAULT + DECLARE_BITMAP(syscall_allow_default, SECCOMP_ARCH_DEFAULT_NR); +#endif +#ifdef SECCOMP_ARCH_COMPAT + DECLARE_BITMAP(syscall_allow_compat, SECCOMP_ARCH_COMPAT_NR); +#endif +}; + +#define SECCOMP_EMU_MAX_PENDING_STATES 64 +#else +struct seccomp_cache_filter_data { }; + +static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * struct seccomp_filter - container for seccomp BPF programs * @@ -159,6 +190,7 @@ struct notification { * this filter after reaching 0. The @users count is always smaller * or equal to @refs. Hence, reaching 0 for @users does not mean * the filter can be freed. + * @cache: container for cache-related data. * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged * @prev: points to a previously installed, or inherited, filter * @prog: the BPF program to evaluate @@ -180,6 +212,7 @@ struct seccomp_filter { refcount_t refs; refcount_t users; bool log; + struct seccomp_cache_filter_data cache; struct seccomp_filter *prev; struct bpf_prog *prog; struct notification *notif; @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) { struct seccomp_filter *sfilter; int ret; - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) return ERR_PTR(-EINVAL); @@ -610,6 +644,136 @@ seccomp_prepare_user_filter(const char __user *user_filter) return filter; } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * seccomp_emu_is_const_allow - check if filter is constant allow with given data + * @fprog: The BPF programs + * @sd: The seccomp data to check against, only syscall number are arch + * number are considered constant. + */ +static bool seccomp_emu_is_const_allow(struct sock_fprog_kern *fprog, + struct seccomp_data *sd) +{ + unsigned int insns; + unsigned int reg_value = 0; + unsigned int pc; + bool op_res; + + if (WARN_ON_ONCE(!fprog)) + return false; + + insns = bpf_classic_proglen(fprog); + for (pc = 0; pc < insns; pc++) { + struct sock_filter *insn = &fprog->filter[pc]; + u16 code = insn->code; + u32 k = insn->k; + + switch (code) { + case BPF_LD | BPF_W | BPF_ABS: + switch (k) { + case offsetof(struct seccomp_data, nr): + reg_value = sd->nr; + break; + case offsetof(struct seccomp_data, arch): + reg_value = sd->arch; + break; + default: + /* can't optimize (non-constant value load) */ + return false; + } + break; + case BPF_RET | BPF_K: + /* reached return with constant values only, check allow */ + return k == SECCOMP_RET_ALLOW; + case BPF_JMP | BPF_JA: + pc += insn->k; + break; + case BPF_JMP | BPF_JEQ | BPF_K: + case BPF_JMP | BPF_JGE | BPF_K: + case BPF_JMP | BPF_JGT | BPF_K: + case BPF_JMP | BPF_JSET | BPF_K: + switch (BPF_OP(code)) { + case BPF_JEQ: + op_res = reg_value == k; + break; + case BPF_JGE: + op_res = reg_value >= k; + break; + case BPF_JGT: + op_res = reg_value > k; + break; + case BPF_JSET: + op_res = !!(reg_value & k); + break; + default: + /* can't optimize (unknown jump) */ + return false; + } + + pc += op_res ? insn->jt : insn->jf; + break; + case BPF_ALU | BPF_AND | BPF_K: + reg_value &= k; + break; + default: + /* can't optimize (unknown insn) */ + return false; + } + } + + /* ran off the end of the filter?! */ + WARN_ON(1); + return false; +} + +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter, + void *bitmap, const void *bitmap_prev, + size_t bitmap_size, int arch) +{ + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; + struct seccomp_data sd; + int nr; + + for (nr = 0; nr < bitmap_size; nr++) { + if (bitmap_prev && !test_bit(nr, bitmap_prev)) + continue; + + sd.nr = nr; + sd.arch = arch; + + if (seccomp_emu_is_const_allow(fprog, &sd)) + set_bit(nr, bitmap); + } +} + +/** + * seccomp_cache_prepare - emulate the filter to find cachable syscalls + * @sfilter: The seccomp filter + * + * Returns 0 if successful or -errno if error occurred. + */ +static void seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ + struct seccomp_cache_filter_data *cache = &sfilter->cache; + const struct seccomp_cache_filter_data *cache_prev = + sfilter->prev ? &sfilter->prev->cache : NULL; + +#ifdef SECCOMP_ARCH_DEFAULT + seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_default, + cache_prev ? cache_prev->syscall_allow_default : NULL, + SECCOMP_ARCH_DEFAULT_NR, + SECCOMP_ARCH_DEFAULT); +#endif /* SECCOMP_ARCH_DEFAULT */ + +#ifdef SECCOMP_ARCH_COMPAT + seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_compat, + cache_prev ? cache_prev->syscall_allow_compat : NULL, + SECCOMP_ARCH_COMPAT_NR, + SECCOMP_ARCH_COMPAT); +#endif /* SECCOMP_ARCH_COMPAT */ +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_attach_filter: validate and attach filter * @flags: flags to change filter behavior @@ -659,6 +823,7 @@ static long seccomp_attach_filter(unsigned int flags, * task reference. */ filter->prev = current->seccomp.filter; + seccomp_cache_prepare(filter); current->seccomp.filter = filter; atomic_inc(¤t->seccomp.filter_count); -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-09-30 15:19 ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu @ 2020-09-30 22:24 ` Jann Horn 2020-09-30 22:49 ` Kees Cook 2020-10-01 11:28 ` YiFei Zhu 2020-09-30 22:40 ` Kees Cook 2020-10-09 4:47 ` YiFei Zhu 2 siblings, 2 replies; 149+ messages in thread From: Jann Horn @ 2020-09-30 22:24 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not > access any syscall arguments or instruction pointer. To facilitate > this we need a static analyser to know whether a filter will > return allow regardless of syscall arguments for a given > architecture number / syscall number pair. This is implemented > here with a pseudo-emulator, and stored in a per-filter bitmap. > > Each common BPF instruction are emulated. Any weirdness or loading > from a syscall argument will cause the emulator to bail. > > The emulation is also halted if it reaches a return. In that case, > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. > > Emulator structure and comments are from Kees [1] and Jann [2]. > > Emulation is done at attach time. If a filter depends on more > filters, and if the dependee does not guarantee to allow the > syscall, then we skip the emulation of this syscall. > > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ [...] > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 1ab22869a765..ff5289228ea5 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -150,6 +150,7 @@ config X86 > select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT > select HAVE_ARCH_PREL32_RELOCATIONS > select HAVE_ARCH_SECCOMP_FILTER > + select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > select HAVE_ARCH_THREAD_STRUCT_WHITELIST > select HAVE_ARCH_STACKLEAK > select HAVE_ARCH_TRACEHOOK If you did the architecture enablement for X86 later in the series, you could move this part over into that patch, that'd be cleaner. > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index ae6b40cc39f4..f09c9e74ae05 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -143,6 +143,37 @@ struct notification { > struct list_head notifications; > }; > > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +/** > + * struct seccomp_cache_filter_data - container for cache's per-filter data > + * > + * Tis struct is ordered to minimize padding holes. I think this comment can probably go away, there isn't really much trickery around padding holes in the struct as it is now. > + * @syscall_allow_default: A bitmap where each bit represents whether the > + * filter willalways allow the syscall, for the nit: s/willalways/will always/ [...] > +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter, > + void *bitmap, const void *bitmap_prev, > + size_t bitmap_size, int arch) > +{ > + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; > + struct seccomp_data sd; > + int nr; > + > + for (nr = 0; nr < bitmap_size; nr++) { > + if (bitmap_prev && !test_bit(nr, bitmap_prev)) > + continue; > + > + sd.nr = nr; > + sd.arch = arch; > + > + if (seccomp_emu_is_const_allow(fprog, &sd)) > + set_bit(nr, bitmap); set_bit() is atomic, but since we only do this at filter setup, before the filter becomes globally visible, we don't need atomicity here. So this should probably use __set_bit() instead. ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-09-30 22:24 ` Jann Horn @ 2020-09-30 22:49 ` Kees Cook 2020-10-01 11:28 ` YiFei Zhu 1 sibling, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-09-30 22:49 UTC (permalink / raw) To: Jann Horn Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 01, 2020 at 12:24:32AM +0200, Jann Horn wrote: > On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not > > access any syscall arguments or instruction pointer. To facilitate > > this we need a static analyser to know whether a filter will > > return allow regardless of syscall arguments for a given > > architecture number / syscall number pair. This is implemented > > here with a pseudo-emulator, and stored in a per-filter bitmap. > > > > Each common BPF instruction are emulated. Any weirdness or loading > > from a syscall argument will cause the emulator to bail. > > > > The emulation is also halted if it reaches a return. In that case, > > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. > > > > Emulator structure and comments are from Kees [1] and Jann [2]. > > > > Emulation is done at attach time. If a filter depends on more > > filters, and if the dependee does not guarantee to allow the > > syscall, then we skip the emulation of this syscall. > > > > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ > > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ > [...] > > +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter, > > + void *bitmap, const void *bitmap_prev, > > + size_t bitmap_size, int arch) > > +{ > > + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; > > + struct seccomp_data sd; > > + int nr; > > + > > + for (nr = 0; nr < bitmap_size; nr++) { > > + if (bitmap_prev && !test_bit(nr, bitmap_prev)) > > + continue; > > + > > + sd.nr = nr; > > + sd.arch = arch; > > + > > + if (seccomp_emu_is_const_allow(fprog, &sd)) > > + set_bit(nr, bitmap); > > set_bit() is atomic, but since we only do this at filter setup, before > the filter becomes globally visible, we don't need atomicity here. So > this should probably use __set_bit() instead. Oh yes, excellent point! That will speed this up a bit. When you do this, please include a comment here describing why its safe to do it non-atomic. :) -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-09-30 22:24 ` Jann Horn 2020-09-30 22:49 ` Kees Cook @ 2020-10-01 11:28 ` YiFei Zhu 2020-10-01 21:08 ` Jann Horn 1 sibling, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-01 11:28 UTC (permalink / raw) To: Jann Horn Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 5:24 PM Jann Horn <jannh@google.com> wrote: > If you did the architecture enablement for X86 later in the series, > you could move this part over into that patch, that'd be cleaner. As in, patch 1: bitmap check logic. patch 2: emulator. patch 3: enable for x86? > > + * Tis struct is ordered to minimize padding holes. > > I think this comment can probably go away, there isn't really much > trickery around padding holes in the struct as it is now. Oh right, I was trying the locks and adding bits to indicate if certain arches are primed, then I undid that. > > + set_bit(nr, bitmap); > > set_bit() is atomic, but since we only do this at filter setup, before > the filter becomes globally visible, we don't need atomicity here. So > this should probably use __set_bit() instead. Right YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-10-01 11:28 ` YiFei Zhu @ 2020-10-01 21:08 ` Jann Horn 0 siblings, 0 replies; 149+ messages in thread From: Jann Horn @ 2020-10-01 21:08 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 1, 2020 at 1:28 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > On Wed, Sep 30, 2020 at 5:24 PM Jann Horn <jannh@google.com> wrote: > > If you did the architecture enablement for X86 later in the series, > > you could move this part over into that patch, that'd be cleaner. > > As in, patch 1: bitmap check logic. patch 2: emulator. patch 3: enable for x86? Yeah. ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-09-30 15:19 ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu 2020-09-30 22:24 ` Jann Horn @ 2020-09-30 22:40 ` Kees Cook 2020-10-01 11:52 ` YiFei Zhu 2020-10-09 4:47 ` YiFei Zhu 2 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-30 22:40 UTC (permalink / raw) To: YiFei Zhu, Jann Horn Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 10:19:13AM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not > access any syscall arguments or instruction pointer. To facilitate > this we need a static analyser to know whether a filter will > return allow regardless of syscall arguments for a given > architecture number / syscall number pair. This is implemented > here with a pseudo-emulator, and stored in a per-filter bitmap. > > Each common BPF instruction are emulated. Any weirdness or loading > from a syscall argument will cause the emulator to bail. > > The emulation is also halted if it reaches a return. In that case, > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. > > Emulator structure and comments are from Kees [1] and Jann [2]. > > Emulation is done at attach time. If a filter depends on more > filters, and if the dependee does not guarantee to allow the > syscall, then we skip the emulation of this syscall. > > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> See comments on patch 3 for reorganizing this a bit for the next version. For the infrastructure patch, I'd like to see much of the cover letter in the commit log (otherwise those details are harder for people to find). That will describe the _why_ for preparing this change, etc. For the emulator patch, I'd like to see the discussion about how the subset of BFP instructions was selected, what libraries Jann and I examined, etc. (For all of these commit logs, I try to pretend that whoever is reading it has not followed any lkml thread of discussion, etc.) > --- > arch/Kconfig | 34 ++++++++++ > arch/x86/Kconfig | 1 + > kernel/seccomp.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++- > 3 files changed, 201 insertions(+), 1 deletion(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 21a3675a7a3a..ca867b2a5d71 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -471,6 +471,14 @@ config HAVE_ARCH_SECCOMP_FILTER > results in the system call being skipped immediately. > - seccomp syscall wired up > > +config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > + bool > + help > + An arch should select this symbol if it provides all of these things: > + - all the requirements for HAVE_ARCH_SECCOMP_FILTER > + - SECCOMP_ARCH_DEFAULT > + - SECCOMP_ARCH_DEFAULT_NR > + There's no need for this config and the per-arch Kconfig clutter: SECCOMP_ARCH_NATIVE will be a sufficient gate. > config SECCOMP > prompt "Enable seccomp to safely execute untrusted bytecode" > def_bool y > @@ -498,6 +506,32 @@ config SECCOMP_FILTER > > See Documentation/userspace-api/seccomp_filter.rst for details. > > +choice > + prompt "Seccomp filter cache" > + default SECCOMP_CACHE_NONE > + depends on SECCOMP_FILTER > + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > + help > + Seccomp filters can potentially incur large overhead for each > + system call. This can alleviate some of the overhead. > + > + If in doubt, select 'syscall numbers only'. > + > +config SECCOMP_CACHE_NONE > + bool "None" > + help > + No caching is done. Seccomp filters will be called each time > + a system call occurs in a seccomp-guarded task. > + > +config SECCOMP_CACHE_NR_ONLY > + bool "Syscall number only" > + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > + help > + For each syscall number, if the seccomp filter has a fixed > + result, store that result in a bitmap to speed up system calls. > + > +endchoice I don't want this config: there is only 1 caching mechanism happening in this series and I do not want to have it buildable as "off": it should be available for all supported architectures. When further caching methods happen, the config can be introduced then (though I'll likely argue it should then be a boot param to allow distro kernels to make it selectable). > + > config HAVE_ARCH_STACKLEAK > bool > help > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 1ab22869a765..ff5289228ea5 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -150,6 +150,7 @@ config X86 > select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT > select HAVE_ARCH_PREL32_RELOCATIONS > select HAVE_ARCH_SECCOMP_FILTER > + select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > select HAVE_ARCH_THREAD_STRUCT_WHITELIST > select HAVE_ARCH_STACKLEAK > select HAVE_ARCH_TRACEHOOK > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index ae6b40cc39f4..f09c9e74ae05 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -143,6 +143,37 @@ struct notification { > struct list_head notifications; > }; > > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +/** > + * struct seccomp_cache_filter_data - container for cache's per-filter data naming nits: "data" doesn't tell me anything. "seccomp_action_cache" might be better. Or since it's an internal struct, maybe just "action_cache". And let's not use the word "container" for the kerndoc. ;) How about "per-filter cache of seccomp actions per arch/syscall pair" > + * > + * Tis struct is ordered to minimize padding holes. typo: This > + * > + * @syscall_allow_default: A bitmap where each bit represents whether the > + * filter willalways allow the syscall, for the typo: missing space > + * default architecture. default -> native > + * @syscall_allow_compat: A bitmap where each bit represents whether the > + * filter will always allow the syscall, for the > + * compat architecture. > + */ > +struct seccomp_cache_filter_data { > +#ifdef SECCOMP_ARCH_DEFAULT > + DECLARE_BITMAP(syscall_allow_default, SECCOMP_ARCH_DEFAULT_NR); naming nit: "syscall" is redundant here, IMO. "allow_native" should be fine. > +#endif > +#ifdef SECCOMP_ARCH_COMPAT > + DECLARE_BITMAP(syscall_allow_compat, SECCOMP_ARCH_COMPAT_NR); > +#endif > +}; > + > +#define SECCOMP_EMU_MAX_PENDING_STATES 64 > +#else > +struct seccomp_cache_filter_data { }; > + > +static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) > +{ > +} > +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ > + > /** > * struct seccomp_filter - container for seccomp BPF programs > * > @@ -159,6 +190,7 @@ struct notification { > * this filter after reaching 0. The @users count is always smaller > * or equal to @refs. Hence, reaching 0 for @users does not mean > * the filter can be freed. > + * @cache: container for cache-related data. more descriptive: "cache of arch/syscall mappings to actions" > * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged > * @prev: points to a previously installed, or inherited, filter > * @prog: the BPF program to evaluate > @@ -180,6 +212,7 @@ struct seccomp_filter { > refcount_t refs; > refcount_t users; > bool log; > + struct seccomp_cache_filter_data cache; > struct seccomp_filter *prev; > struct bpf_prog *prog; > struct notification *notif; > @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) > { > struct seccomp_filter *sfilter; > int ret; > - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); > + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || > + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); > > if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) > return ERR_PTR(-EINVAL); > @@ -610,6 +644,136 @@ seccomp_prepare_user_filter(const char __user *user_filter) > return filter; > } > > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +/** > + * seccomp_emu_is_const_allow - check if filter is constant allow with given data > + * @fprog: The BPF programs > + * @sd: The seccomp data to check against, only syscall number are arch > + * number are considered constant. > + */ > +static bool seccomp_emu_is_const_allow(struct sock_fprog_kern *fprog, > + struct seccomp_data *sd) naming: I would drop "emu" from here. The caller doesn't care how it is determined. ;) > +{ > + unsigned int insns; > + unsigned int reg_value = 0; > + unsigned int pc; > + bool op_res; > + > + if (WARN_ON_ONCE(!fprog)) > + return false; > + > + insns = bpf_classic_proglen(fprog); > + for (pc = 0; pc < insns; pc++) { > + struct sock_filter *insn = &fprog->filter[pc]; > + u16 code = insn->code; > + u32 k = insn->k; > + > + switch (code) { > + case BPF_LD | BPF_W | BPF_ABS: > + switch (k) { > + case offsetof(struct seccomp_data, nr): > + reg_value = sd->nr; > + break; > + case offsetof(struct seccomp_data, arch): > + reg_value = sd->arch; > + break; > + default: > + /* can't optimize (non-constant value load) */ > + return false; > + } > + break; > + case BPF_RET | BPF_K: > + /* reached return with constant values only, check allow */ > + return k == SECCOMP_RET_ALLOW; > + case BPF_JMP | BPF_JA: > + pc += insn->k; > + break; > + case BPF_JMP | BPF_JEQ | BPF_K: > + case BPF_JMP | BPF_JGE | BPF_K: > + case BPF_JMP | BPF_JGT | BPF_K: > + case BPF_JMP | BPF_JSET | BPF_K: > + switch (BPF_OP(code)) { > + case BPF_JEQ: > + op_res = reg_value == k; > + break; > + case BPF_JGE: > + op_res = reg_value >= k; > + break; > + case BPF_JGT: > + op_res = reg_value > k; > + break; > + case BPF_JSET: > + op_res = !!(reg_value & k); > + break; > + default: > + /* can't optimize (unknown jump) */ > + return false; > + } > + > + pc += op_res ? insn->jt : insn->jf; > + break; > + case BPF_ALU | BPF_AND | BPF_K: > + reg_value &= k; > + break; > + default: > + /* can't optimize (unknown insn) */ > + return false; > + } > + } > + > + /* ran off the end of the filter?! */ > + WARN_ON(1); > + return false; > +} For the emulator patch, you'll want to include these tags in the commit log: Suggested-by: Jann Horn <jannh@google.com> Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> > + > +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter, > + void *bitmap, const void *bitmap_prev, > + size_t bitmap_size, int arch) > +{ > + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; > + struct seccomp_data sd; > + int nr; > + > + for (nr = 0; nr < bitmap_size; nr++) { > + if (bitmap_prev && !test_bit(nr, bitmap_prev)) > + continue; > + > + sd.nr = nr; > + sd.arch = arch; > + > + if (seccomp_emu_is_const_allow(fprog, &sd)) > + set_bit(nr, bitmap); The guiding principle with seccomp's designs is to always make things _more_ restrictive, never less. While we can never escape the consequences of having seccomp_is_const_allow() report the wrong answer, we can at least follow the basic principles, hopefully minimizing the impact. When the bitmap starts with "always allowed" and we only flip it towards "run full filters", we're only ever making things more restrictive. If we instead go from "run full filters" towards "always allowed", we run the risk of making things less restrictive. For example: a process that maliciously adds a filter that the emulator mistakenly evaluates to "always allow" doesn't suddenly cause all the prior filters to stop running. (i.e. this isolates the flaw outcome, and doesn't depend on the early "do not emulate if we already know we have to run filters" case before the emulation call: there is no code path that allows the cache to weaken: it can only maintain it being wrong). Without any seccomp filter installed, all syscalls are "always allowed" (from the perspective of the seccomp boundary), so the default of the cache needs to be "always allowed". if (bitmap_prev) { /* The new filter must be as restrictive as the last. */ bitmap_copy(bitmap, bitmap_prev, bitmap_size); } else { /* Before any filters, all syscalls are always allowed. */ bitmap_fill(bitmap, bitmap_size); } for (nr = 0; nr < bitmap_size; nr++) { /* No bitmap change: not a cacheable action. */ if (!test_bit(nr, bitmap_prev) || continue; /* No bitmap change: continue to always allow. */ if (seccomp_is_const_allow(fprog, &sd)) continue; /* Not a cacheable action: always run filters. */ clear_bit(nr, bitmap); > + } > +} > + > +/** > + * seccomp_cache_prepare - emulate the filter to find cachable syscalls > + * @sfilter: The seccomp filter > + * > + * Returns 0 if successful or -errno if error occurred. > + */ > +static void seccomp_cache_prepare(struct seccomp_filter *sfilter) > +{ > + struct seccomp_cache_filter_data *cache = &sfilter->cache; > + const struct seccomp_cache_filter_data *cache_prev = > + sfilter->prev ? &sfilter->prev->cache : NULL; > + > +#ifdef SECCOMP_ARCH_DEFAULT > + seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_default, > + cache_prev ? cache_prev->syscall_allow_default : NULL, > + SECCOMP_ARCH_DEFAULT_NR, > + SECCOMP_ARCH_DEFAULT); > +#endif /* SECCOMP_ARCH_DEFAULT */ > + > +#ifdef SECCOMP_ARCH_COMPAT > + seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_compat, > + cache_prev ? cache_prev->syscall_allow_compat : NULL, > + SECCOMP_ARCH_COMPAT_NR, > + SECCOMP_ARCH_COMPAT); > +#endif /* SECCOMP_ARCH_COMPAT */ > +} > +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ > + > /** > * seccomp_attach_filter: validate and attach filter > * @flags: flags to change filter behavior > @@ -659,6 +823,7 @@ static long seccomp_attach_filter(unsigned int flags, > * task reference. > */ > filter->prev = current->seccomp.filter; > + seccomp_cache_prepare(filter); > current->seccomp.filter = filter; Jann, do we need a WRITE_ONCE() or something when writing current->seccomp.filter here? I think the rmb() in __seccomp_filter() will cover the cache bitmap writes having finished before the filter pointer is followed in the TSYNC case. > atomic_inc(¤t->seccomp.filter_count); > > -- > 2.28.0 > Otherwise, yes, I'm looking forward to having this for everyone to use! :) -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-09-30 22:40 ` Kees Cook @ 2020-10-01 11:52 ` YiFei Zhu 2020-10-01 21:05 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-01 11:52 UTC (permalink / raw) To: Kees Cook Cc: Jann Horn, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 5:40 PM Kees Cook <keescook@chromium.org> wrote: > I don't want this config: there is only 1 caching mechanism happening > in this series and I do not want to have it buildable as "off": it > should be available for all supported architectures. When further caching > methods happen, the config can be introduced then (though I'll likely > argue it should then be a boot param to allow distro kernels to make it > selectable). Alright, we can think about configuration (or boot param) when more methods happen then. > The guiding principle with seccomp's designs is to always make things > _more_ restrictive, never less. While we can never escape the > consequences of having seccomp_is_const_allow() report the wrong > answer, we can at least follow the basic principles, hopefully > minimizing the impact. > > When the bitmap starts with "always allowed" and we only flip it towards > "run full filters", we're only ever making things more restrictive. If > we instead go from "run full filters" towards "always allowed", we run > the risk of making things less restrictive. For example: a process that > maliciously adds a filter that the emulator mistakenly evaluates to > "always allow" doesn't suddenly cause all the prior filters to stop running. > (i.e. this isolates the flaw outcome, and doesn't depend on the early > "do not emulate if we already know we have to run filters" case before > the emulation call: there is no code path that allows the cache to > weaken: it can only maintain it being wrong). > > Without any seccomp filter installed, all syscalls are "always allowed" > (from the perspective of the seccomp boundary), so the default of the > cache needs to be "always allowed". I cannot follow this. If a 'process that maliciously adds a filter that the emulator mistakenly evaluates to "always allow" doesn't suddenly cause all the prior filters to stop running', hence, you want, by default, the cache to be as transparent as possible. You would lift the restriction if and only if you are absolutely sure it does not cause an impact. In this patch, if there are prior filters, it goes through this logic: if (bitmap_prev && !test_bit(nr, bitmap_prev)) continue; Hence, if the malicious filter were to happen, and prior filters were supposed to run, then seccomp_is_const_allow is simply not invoked -- what it returns cannot be used maliciously by an adversary. > if (bitmap_prev) { > /* The new filter must be as restrictive as the last. */ > bitmap_copy(bitmap, bitmap_prev, bitmap_size); > } else { > /* Before any filters, all syscalls are always allowed. */ > bitmap_fill(bitmap, bitmap_size); > } > > for (nr = 0; nr < bitmap_size; nr++) { > /* No bitmap change: not a cacheable action. */ > if (!test_bit(nr, bitmap_prev) || > continue; > > /* No bitmap change: continue to always allow. */ > if (seccomp_is_const_allow(fprog, &sd)) > continue; > > /* Not a cacheable action: always run filters. */ > clear_bit(nr, bitmap); I'm not strongly against this logic. I just feel unconvinced that this is any different with a slightly increased complexity. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-10-01 11:52 ` YiFei Zhu @ 2020-10-01 21:05 ` Kees Cook 2020-10-02 11:08 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-10-01 21:05 UTC (permalink / raw) To: YiFei Zhu Cc: Jann Horn, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 01, 2020 at 06:52:50AM -0500, YiFei Zhu wrote: > On Wed, Sep 30, 2020 at 5:40 PM Kees Cook <keescook@chromium.org> wrote: > > The guiding principle with seccomp's designs is to always make things > > _more_ restrictive, never less. While we can never escape the > > consequences of having seccomp_is_const_allow() report the wrong > > answer, we can at least follow the basic principles, hopefully > > minimizing the impact. > > > > When the bitmap starts with "always allowed" and we only flip it towards > > "run full filters", we're only ever making things more restrictive. If > > we instead go from "run full filters" towards "always allowed", we run > > the risk of making things less restrictive. For example: a process that > > maliciously adds a filter that the emulator mistakenly evaluates to > > "always allow" doesn't suddenly cause all the prior filters to stop running. > > (i.e. this isolates the flaw outcome, and doesn't depend on the early > > "do not emulate if we already know we have to run filters" case before > > the emulation call: there is no code path that allows the cache to > > weaken: it can only maintain it being wrong). > > > > Without any seccomp filter installed, all syscalls are "always allowed" > > (from the perspective of the seccomp boundary), so the default of the > > cache needs to be "always allowed". > > I cannot follow this. If a 'process that maliciously adds a filter > that the emulator mistakenly evaluates to "always allow" doesn't > suddenly cause all the prior filters to stop running', hence, you > want, by default, the cache to be as transparent as possible. You > would lift the restriction if and only if you are absolutely sure it > does not cause an impact. Yes, right now, the v3 code pattern is entirely safe. > > In this patch, if there are prior filters, it goes through this logic: > > if (bitmap_prev && !test_bit(nr, bitmap_prev)) > continue; > > Hence, if the malicious filter were to happen, and prior filters were > supposed to run, then seccomp_is_const_allow is simply not invoked -- > what it returns cannot be used maliciously by an adversary. Right, but we depend on that test always doing the correct thing (and continuing to do so into the future). I'm looking at this from the perspective of future changes, maintenance, etc. I want the actions to match the design principles as closely as possible so that future evolutions of the code have lower risk to bugs causing security failures. Right now, the code is simple. I want to design this so that when it is complex, it will still fail toward safety in the face of bugs. > > if (bitmap_prev) { > > /* The new filter must be as restrictive as the last. */ > > bitmap_copy(bitmap, bitmap_prev, bitmap_size); > > } else { > > /* Before any filters, all syscalls are always allowed. */ > > bitmap_fill(bitmap, bitmap_size); > > } > > > > for (nr = 0; nr < bitmap_size; nr++) { > > /* No bitmap change: not a cacheable action. */ > > if (!test_bit(nr, bitmap_prev) || > > continue; > > > > /* No bitmap change: continue to always allow. */ > > if (seccomp_is_const_allow(fprog, &sd)) > > continue; > > > > /* Not a cacheable action: always run filters. */ > > clear_bit(nr, bitmap); > > I'm not strongly against this logic. I just feel unconvinced that this > is any different with a slightly increased complexity. I'd prefer this way because for the loop, the tests, and the results only make the bitmap more restrictive. The worst thing a bug in here can do is leave the bitmap unchanged (which is certainly bad), but it can't _undo_ an earlier restriction. The proposed loop's leading test_bit() becomes only an optimization, rather than being required for policy enforcement. In other words, I prefer: inherit all prior prior bitmap restrictions for all syscalls if this filter not restricted continue set bitmap restricted within this loop (where the bulk of future logic may get added), the worse-case future bug-induced failure mode for the syscall bitmap is "skip *this* filter". Instead of: set bitmap all restricted for all syscalls if previous bitmap not restricted and filter not restricted set bitmap unrestricted within this loop the worst-case future bug-induced failure mode for the syscall bitmap is "skip *all* filters". Or, to reword again, this: retain restrictions from previous caching decisions for all syscalls [evaluate this filter, maybe continue] set restricted instead of: set new cache to all restricted for all syscalls [evaluate prior cache and this filter, maybe continue] set unrestricted I expect the future code changes for caching to be in the "evaluate" step, so I'd like the code designed to make things MORE restrictive not less from the start, and remove any prior cache state tests from the loop. At the end of the day I believe changing the design like this now lays the groundwork to the caching mechanism being more robust against having future bugs introduce security flaws. -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-10-01 21:05 ` Kees Cook @ 2020-10-02 11:08 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-10-02 11:08 UTC (permalink / raw) To: Kees Cook Cc: Jann Horn, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 1, 2020 at 4:05 PM Kees Cook <keescook@chromium.org> wrote: > Right, but we depend on that test always doing the correct thing (and > continuing to do so into the future). I'm looking at this from the > perspective of future changes, maintenance, etc. I want the actions to > match the design principles as closely as possible so that future > evolutions of the code have lower risk to bugs causing security > failures. Right now, the code is simple. I want to design this so that > when it is complex, it will still fail toward safety in the face of > bugs. > > I'd prefer this way because for the loop, the tests, and the results only > make the bitmap more restrictive. The worst thing a bug in here can do is > leave the bitmap unchanged (which is certainly bad), but it can't _undo_ > an earlier restriction. > > The proposed loop's leading test_bit() becomes only an optimization, > rather than being required for policy enforcement. > > In other words, I prefer: > > inherit all prior prior bitmap restrictions > for all syscalls > if this filter not restricted > continue > set bitmap restricted > > within this loop (where the bulk of future logic may get added), > the worse-case future bug-induced failure mode for the syscall > bitmap is "skip *this* filter". > > > Instead of: > > set bitmap all restricted > for all syscalls > if previous bitmap not restricted and > filter not restricted > set bitmap unrestricted > > within this loop the worst-case future bug-induced failure mode > for the syscall bitmap is "skip *all* filters". > > > > > Or, to reword again, this: > > retain restrictions from previous caching decisions > for all syscalls > [evaluate this filter, maybe continue] > set restricted > > instead of: > > set new cache to all restricted > for all syscalls > [evaluate prior cache and this filter, maybe continue] > set unrestricted > > I expect the future code changes for caching to be in the "evaluate" > step, so I'd like the code designed to make things MORE restrictive not > less from the start, and remove any prior cache state tests from the > loop. > > At the end of the day I believe changing the design like this now lays > the groundwork to the caching mechanism being more robust against having > future bugs introduce security flaws. > I see. Makes sense. Thanks. Will do that in the v4 YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-09-30 15:19 ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu 2020-09-30 22:24 ` Jann Horn 2020-09-30 22:40 ` Kees Cook @ 2020-10-09 4:47 ` YiFei Zhu 2020-10-09 5:41 ` Kees Cook 2 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-09 4:47 UTC (permalink / raw) To: Linux Containers Cc: YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 10:20 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) > { > struct seccomp_filter *sfilter; > int ret; > - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); > + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || > + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); > > if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) > return ERR_PTR(-EINVAL); I'm trying to use __is_defined(SECCOMP_ARCH_NATIVE) here, and got this message: kernel/seccomp.c: In function ‘seccomp_prepare_filter’: ././include/linux/kconfig.h:44:44: error: pasting "__ARG_PLACEHOLDER_" and "(" does not give a valid preprocessing token 44 | #define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val) | ^~~~~~~~~~~~~~~~~~ ././include/linux/kconfig.h:43:27: note: in expansion of macro ‘___is_defined’ 43 | #define __is_defined(x) ___is_defined(x) | ^~~~~~~~~~~~~ kernel/seccomp.c:629:11: note: in expansion of macro ‘__is_defined’ 629 | __is_defined(SECCOMP_ARCH_NATIVE); | ^~~~~~~~~~~~ Looking at the implementation of __is_defined, it is: #define __ARG_PLACEHOLDER_1 0, #define __take_second_arg(__ignored, val, ...) val #define __is_defined(x) ___is_defined(x) #define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val) #define ____is_defined(arg1_or_junk) __take_second_arg(arg1_or_junk 1, 0) Hence, when FOO is defined to be 1, then the expansion would be __is_defined(FOO) -> ___is_defined(1) -> ____is_defined(__ARG_PLACEHOLDER_1) -> __take_second_arg(0, 1, 0) -> 1, and when FOO is not defined, the expansion would be __is_defined(FOO) -> ___is_defined(FOO) -> ____is_defined(__ARG_PLACEHOLDER_FOO) -> __take_second_arg(__ARG_PLACEHOLDER_FOO 1, 0) -> 0 However, here SECCOMP_ARCH_NATIVE is an expression from an OR of some bits, and __is_defined(SECCOMP_ARCH_NATIVE) would not expand to __ARG_PLACEHOLDER_1 during any stage in the preprocessing. Is there any better way to do this? I'm thinking of just doing #if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE) like in Kee's patch. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-10-09 4:47 ` YiFei Zhu @ 2020-10-09 5:41 ` Kees Cook 0 siblings, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-10-09 5:41 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 08, 2020 at 11:47:17PM -0500, YiFei Zhu wrote: > On Wed, Sep 30, 2020 at 10:20 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) > > { > > struct seccomp_filter *sfilter; > > int ret; > > - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); > > + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || > > + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); > > > > if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) > > return ERR_PTR(-EINVAL); > > I'm trying to use __is_defined(SECCOMP_ARCH_NATIVE) here, and got this message: > > kernel/seccomp.c: In function ‘seccomp_prepare_filter’: > ././include/linux/kconfig.h:44:44: error: pasting "__ARG_PLACEHOLDER_" > and "(" does not give a valid preprocessing token > 44 | #define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val) > | ^~~~~~~~~~~~~~~~~~ > ././include/linux/kconfig.h:43:27: note: in expansion of macro ‘___is_defined’ > 43 | #define __is_defined(x) ___is_defined(x) > | ^~~~~~~~~~~~~ > kernel/seccomp.c:629:11: note: in expansion of macro ‘__is_defined’ > 629 | __is_defined(SECCOMP_ARCH_NATIVE); > | ^~~~~~~~~~~~ > > Looking at the implementation of __is_defined, it is: > > #define __ARG_PLACEHOLDER_1 0, > #define __take_second_arg(__ignored, val, ...) val > #define __is_defined(x) ___is_defined(x) > #define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val) > #define ____is_defined(arg1_or_junk) __take_second_arg(arg1_or_junk 1, 0) > > Hence, when FOO is defined to be 1, then the expansion would be > __is_defined(FOO) -> ___is_defined(1) -> > ____is_defined(__ARG_PLACEHOLDER_1) -> __take_second_arg(0, 1, 0) -> > 1, > and when FOO is not defined, the expansion would be __is_defined(FOO) > -> ___is_defined(FOO) -> ____is_defined(__ARG_PLACEHOLDER_FOO) -> > __take_second_arg(__ARG_PLACEHOLDER_FOO 1, 0) -> 0 > > However, here SECCOMP_ARCH_NATIVE is an expression from an OR of some > bits, and __is_defined(SECCOMP_ARCH_NATIVE) would not expand to > __ARG_PLACEHOLDER_1 during any stage in the preprocessing. > > Is there any better way to do this? I'm thinking of just doing #if > defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE) > like in Kee's patch. Yeah, I think that's simplest. -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path 2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 2020-09-30 15:19 ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu 2020-09-30 15:19 ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu @ 2020-09-30 15:19 ` YiFei Zhu 2020-09-30 21:32 ` Kees Cook 2020-09-30 15:19 ` [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu ` (2 subsequent siblings) 5 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> The fast (common) path for seccomp should be that the filter permits the syscall to pass through, and failing seccomp is expected to be an exceptional case; it is not expected for userspace to call a denylisted syscall over and over. This first finds the current allow bitmask by iterating through syscall_arches[] array and comparing it to the one in struct seccomp_data; this loop is expected to be unrolled. It then does a test_bit against the bitmask. If the bit is set, then there is no need to run the full filter; it returns SECCOMP_RET_ALLOW immediately. Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- kernel/seccomp.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index f09c9e74ae05..bed3b2a7f6c8 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -172,6 +172,12 @@ struct seccomp_cache_filter_data { }; static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) { } + +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + return false; +} #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ /** @@ -331,6 +337,49 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) return 0; } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +static bool seccomp_cache_check_bitmap(const void *bitmap, size_t bitmap_size, + int syscall_nr) +{ + if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size)) + return false; + syscall_nr = array_index_nospec(syscall_nr, bitmap_size); + + return test_bit(syscall_nr, bitmap); +} + +/** + * seccomp_cache_check - lookup seccomp cache + * @sfilter: The seccomp filter + * @sd: The seccomp data to lookup the cache with + * + * Returns true if the seccomp_data is cached and allowed. + */ +static bool seccomp_cache_check(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + int syscall_nr = sd->nr; + const struct seccomp_cache_filter_data *cache = &sfilter->cache; + +#ifdef SECCOMP_ARCH_DEFAULT + if (likely(sd->arch == SECCOMP_ARCH_DEFAULT)) + return seccomp_cache_check_bitmap(cache->syscall_allow_default, + SECCOMP_ARCH_DEFAULT_NR, + syscall_nr); +#endif /* SECCOMP_ARCH_DEFAULT */ + +#ifdef SECCOMP_ARCH_COMPAT + if (likely(sd->arch == SECCOMP_ARCH_COMPAT)) + return seccomp_cache_check_bitmap(cache->syscall_allow_compat, + SECCOMP_ARCH_COMPAT_NR, + syscall_nr); +#endif /* SECCOMP_ARCH_COMPAT */ + + WARN_ON_ONCE(true); + return false; +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_run_filters - evaluates all seccomp filters against @sd * @sd: optional seccomp data to be passed to filters @@ -353,6 +402,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, if (WARN_ON(f == NULL)) return SECCOMP_RET_KILL_PROCESS; + if (seccomp_cache_check(f, sd)) + return SECCOMP_RET_ALLOW; + /* * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path 2020-09-30 15:19 ` [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu @ 2020-09-30 21:32 ` Kees Cook 2020-10-09 0:17 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-30 21:32 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 10:19:14AM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > The fast (common) path for seccomp should be that the filter permits > the syscall to pass through, and failing seccomp is expected to be > an exceptional case; it is not expected for userspace to call a > denylisted syscall over and over. > > This first finds the current allow bitmask by iterating through > syscall_arches[] array and comparing it to the one in struct > seccomp_data; this loop is expected to be unrolled. It then > does a test_bit against the bitmask. If the bit is set, then > there is no need to run the full filter; it returns > SECCOMP_RET_ALLOW immediately. > > Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> I'd like the content/ordering of this and the emulator patch to be reorganized a bit. I'd like to see the infrastructure of the cache added first (along with the "always allow" test logic in this patch), with the emulator missing: i.e. the patch is a logical no-op: no behavior changes because nothing ever changes the cache bits, but all the operational logic, structure changes, etc, is in place. Then the next patch would be replacing the no-op with the emulator. > --- > kernel/seccomp.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 52 insertions(+) > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index f09c9e74ae05..bed3b2a7f6c8 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -172,6 +172,12 @@ struct seccomp_cache_filter_data { }; > static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) > { > } > + > +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter, bikeshedding: "cache check" doesn't tell me anything about what it's actually checking for. How about calling this seccomp_is_constant_allow() or something that reflects both the "bool" return ("is") and what that bool means ("should always be allowed"). > + const struct seccomp_data *sd) > +{ > + return false; > +} > #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ > > /** > @@ -331,6 +337,49 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) > return 0; > } > > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +static bool seccomp_cache_check_bitmap(const void *bitmap, size_t bitmap_size, Please also mark as "inline". > + int syscall_nr) > +{ > + if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size)) > + return false; > + syscall_nr = array_index_nospec(syscall_nr, bitmap_size); > + > + return test_bit(syscall_nr, bitmap); > +} > + > +/** > + * seccomp_cache_check - lookup seccomp cache > + * @sfilter: The seccomp filter > + * @sd: The seccomp data to lookup the cache with > + * > + * Returns true if the seccomp_data is cached and allowed. > + */ > +static bool seccomp_cache_check(const struct seccomp_filter *sfilter, inline too. > + const struct seccomp_data *sd) > +{ > + int syscall_nr = sd->nr; > + const struct seccomp_cache_filter_data *cache = &sfilter->cache; > + > +#ifdef SECCOMP_ARCH_DEFAULT > + if (likely(sd->arch == SECCOMP_ARCH_DEFAULT)) > + return seccomp_cache_check_bitmap(cache->syscall_allow_default, > + SECCOMP_ARCH_DEFAULT_NR, > + syscall_nr); > +#endif /* SECCOMP_ARCH_DEFAULT */ > + > +#ifdef SECCOMP_ARCH_COMPAT > + if (likely(sd->arch == SECCOMP_ARCH_COMPAT)) > + return seccomp_cache_check_bitmap(cache->syscall_allow_compat, > + SECCOMP_ARCH_COMPAT_NR, > + syscall_nr); > +#endif /* SECCOMP_ARCH_COMPAT */ > + > + WARN_ON_ONCE(true); > + return false; > +} > +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ > + > /** > * seccomp_run_filters - evaluates all seccomp filters against @sd > * @sd: optional seccomp data to be passed to filters > @@ -353,6 +402,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, > if (WARN_ON(f == NULL)) > return SECCOMP_RET_KILL_PROCESS; > > + if (seccomp_cache_check(f, sd)) > + return SECCOMP_RET_ALLOW; > + > /* > * All filters in the list are evaluated and the lowest BPF return > * value always takes priority (ignoring the DATA). > -- > 2.28.0 > Otherwise, yup, looks good. -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path 2020-09-30 21:32 ` Kees Cook @ 2020-10-09 0:17 ` YiFei Zhu 2020-10-09 5:35 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-09 0:17 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 4:32 PM Kees Cook <keescook@chromium.org> wrote: > > On Wed, Sep 30, 2020 at 10:19:14AM -0500, YiFei Zhu wrote: > > From: YiFei Zhu <yifeifz2@illinois.edu> > > > > The fast (common) path for seccomp should be that the filter permits > > the syscall to pass through, and failing seccomp is expected to be > > an exceptional case; it is not expected for userspace to call a > > denylisted syscall over and over. > > > > This first finds the current allow bitmask by iterating through > > syscall_arches[] array and comparing it to the one in struct > > seccomp_data; this loop is expected to be unrolled. It then > > does a test_bit against the bitmask. If the bit is set, then > > there is no need to run the full filter; it returns > > SECCOMP_RET_ALLOW immediately. > > > > Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > > Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > > I'd like the content/ordering of this and the emulator patch to be reorganized a bit. > I'd like to see the infrastructure of the cache added first (along with > the "always allow" test logic in this patch), with the emulator missing: > i.e. the patch is a logical no-op: no behavior changes because nothing > ever changes the cache bits, but all the operational logic, structure > changes, etc, is in place. Then the next patch would be replacing the > no-op with the emulator. > > > --- > > kernel/seccomp.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 52 insertions(+) > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > > index f09c9e74ae05..bed3b2a7f6c8 100644 > > --- a/kernel/seccomp.c > > +++ b/kernel/seccomp.c > > @@ -172,6 +172,12 @@ struct seccomp_cache_filter_data { }; > > static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) > > { > > } > > + > > +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter, > > bikeshedding: "cache check" doesn't tell me anything about what it's > actually checking for. How about calling this seccomp_is_constant_allow() or > something that reflects both the "bool" return ("is") and what that bool > means ("should always be allowed"). We have a naming conflict here. I'm about to rename seccomp_emu_is_const_allow to seccomp_is_const_allow. Adding another seccomp_is_constant_allow is confusing. Suggestions? I think I would prefer to change seccomp_cache_check to seccomp_cache_check_allow. While in this patch set seccomp_cache_check does imply the filter is "constant" allow, argument-processing cache may change this, and specifying an "allow" in the name specifies the 'what that bool means ("should always be allowed")'. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path 2020-10-09 0:17 ` YiFei Zhu @ 2020-10-09 5:35 ` Kees Cook 0 siblings, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-10-09 5:35 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 08, 2020 at 07:17:39PM -0500, YiFei Zhu wrote: > On Wed, Sep 30, 2020 at 4:32 PM Kees Cook <keescook@chromium.org> wrote: > > > > On Wed, Sep 30, 2020 at 10:19:14AM -0500, YiFei Zhu wrote: > > > From: YiFei Zhu <yifeifz2@illinois.edu> > > > > > > The fast (common) path for seccomp should be that the filter permits > > > the syscall to pass through, and failing seccomp is expected to be > > > an exceptional case; it is not expected for userspace to call a > > > denylisted syscall over and over. > > > > > > This first finds the current allow bitmask by iterating through > > > syscall_arches[] array and comparing it to the one in struct > > > seccomp_data; this loop is expected to be unrolled. It then > > > does a test_bit against the bitmask. If the bit is set, then > > > there is no need to run the full filter; it returns > > > SECCOMP_RET_ALLOW immediately. > > > > > > Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > > > Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > > > > I'd like the content/ordering of this and the emulator patch to be reorganized a bit. > > I'd like to see the infrastructure of the cache added first (along with > > the "always allow" test logic in this patch), with the emulator missing: > > i.e. the patch is a logical no-op: no behavior changes because nothing > > ever changes the cache bits, but all the operational logic, structure > > changes, etc, is in place. Then the next patch would be replacing the > > no-op with the emulator. > > > > > --- > > > kernel/seccomp.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 52 insertions(+) > > > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > > > index f09c9e74ae05..bed3b2a7f6c8 100644 > > > --- a/kernel/seccomp.c > > > +++ b/kernel/seccomp.c > > > @@ -172,6 +172,12 @@ struct seccomp_cache_filter_data { }; > > > static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) > > > { > > > } > > > + > > > +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter, > > > > bikeshedding: "cache check" doesn't tell me anything about what it's > > actually checking for. How about calling this seccomp_is_constant_allow() or > > something that reflects both the "bool" return ("is") and what that bool > > means ("should always be allowed"). > > We have a naming conflict here. I'm about to rename > seccomp_emu_is_const_allow to seccomp_is_const_allow. Adding another > seccomp_is_constant_allow is confusing. Suggestions? > > I think I would prefer to change seccomp_cache_check to > seccomp_cache_check_allow. While in this patch set seccomp_cache_check > does imply the filter is "constant" allow, argument-processing cache > may change this, and specifying an "allow" in the name specifies the > 'what that bool means ("should always be allowed")'. Yeah, that seems good. -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead 2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu ` (2 preceding siblings ...) 2020-09-30 15:19 ` [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu @ 2020-09-30 15:19 ` YiFei Zhu 2020-09-30 15:19 ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-10-09 17:14 ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 5 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: Kees Cook <keescook@chromium.org> As part of the seccomp benchmarking, include the expectations with regard to the timing behavior of the constant action bitmaps, and report inconsistencies better. Example output with constant action bitmaps on x86: $ sudo ./seccomp_benchmark 100000000 Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 200000000 syscalls... 129.359381409 - 0.008724424 = 129350656985 (129.4s) getpid native: 646 ns 264.385890006 - 129.360453229 = 135025436777 (135.0s) getpid RET_ALLOW 1 filter (bitmap): 675 ns 399.400511893 - 264.387045901 = 135013465992 (135.0s) getpid RET_ALLOW 2 filters (bitmap): 675 ns 545.872866260 - 399.401718327 = 146471147933 (146.5s) getpid RET_ALLOW 3 filters (full): 732 ns 696.337101319 - 545.874097681 = 150463003638 (150.5s) getpid RET_ALLOW 4 filters (full): 752 ns Estimated total seccomp overhead for 1 bitmapped filter: 29 ns Estimated total seccomp overhead for 2 bitmapped filters: 29 ns Estimated total seccomp overhead for 3 full filters: 86 ns Estimated total seccomp overhead for 4 full filters: 106 ns Estimated seccomp entry overhead: 29 ns Estimated seccomp per-filter overhead (last 2 diff): 20 ns Estimated seccomp per-filter overhead (filters / 4): 19 ns Expectations: native ≤ 1 bitmap (646 ≤ 675): ✔️ native ≤ 1 filter (646 ≤ 732): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️ 1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️ entry ≈ 1 bitmapped (29 ≈ 29): ✔️ entry ≈ 2 bitmapped (29 ≈ 29): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️ Signed-off-by: Kees Cook <keescook@chromium.org> [YiFei: Changed commit message to show stats for this patch series] Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- .../selftests/seccomp/seccomp_benchmark.c | 151 +++++++++++++++--- tools/testing/selftests/seccomp/settings | 2 +- 2 files changed, 130 insertions(+), 23 deletions(-) diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c index 91f5a89cadac..fcc806585266 100644 --- a/tools/testing/selftests/seccomp/seccomp_benchmark.c +++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c @@ -4,12 +4,16 @@ */ #define _GNU_SOURCE #include <assert.h> +#include <limits.h> +#include <stdbool.h> +#include <stddef.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <linux/filter.h> #include <linux/seccomp.h> +#include <sys/param.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <sys/types.h> @@ -70,18 +74,74 @@ unsigned long long calibrate(void) return samples * seconds; } +bool approx(int i_one, int i_two) +{ + double one = i_one, one_bump = one * 0.01; + double two = i_two, two_bump = two * 0.01; + + one_bump = one + MAX(one_bump, 2.0); + two_bump = two + MAX(two_bump, 2.0); + + /* Equal to, or within 1% or 2 digits */ + if (one == two || + (one > two && one <= two_bump) || + (two > one && two <= one_bump)) + return true; + return false; +} + +bool le(int i_one, int i_two) +{ + if (i_one <= i_two) + return true; + return false; +} + +long compare(const char *name_one, const char *name_eval, const char *name_two, + unsigned long long one, bool (*eval)(int, int), unsigned long long two) +{ + bool good; + + printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two, + (long long)one, name_eval, (long long)two); + if (one > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)one); + return 1; + } + if (two > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)two); + return 1; + } + + good = eval(one, two); + printf("%s\n", good ? "✔️" : "❌"); + + return good ? 0 : 1; +} + int main(int argc, char *argv[]) { + struct sock_filter bitmap_filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), + }; + struct sock_fprog bitmap_prog = { + .len = (unsigned short)ARRAY_SIZE(bitmap_filter), + .filter = bitmap_filter, + }; struct sock_filter filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog prog = { .len = (unsigned short)ARRAY_SIZE(filter), .filter = filter, }; - long ret; - unsigned long long samples; - unsigned long long native, filter1, filter2; + + long ret, bits; + unsigned long long samples, calc; + unsigned long long native, filter1, filter2, bitmap1, bitmap2; + unsigned long long entry, per_filter1, per_filter2; printf("Current BPF sysctl settings:\n"); system("sysctl net.core.bpf_jit_enable"); @@ -101,35 +161,82 @@ int main(int argc, char *argv[]) ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); assert(ret == 0); - /* One filter */ - ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); + /* One filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); assert(ret == 0); - filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1); + bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1); + + /* Second filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - if (filter1 == native) - printf("No overhead measured!? Try running again with more samples.\n"); + bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2); - /* Two filters */ + /* Third filter, can no longer be converted to bitmap */ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); assert(ret == 0); - filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2); - - /* Calculations */ - printf("Estimated total seccomp overhead for 1 filter: %llu ns\n", - filter1 - native); + filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1); - printf("Estimated total seccomp overhead for 2 filters: %llu ns\n", - filter2 - native); + /* Fourth filter, can not be converted to bitmap because of filter 3 */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - printf("Estimated seccomp per-filter overhead: %llu ns\n", - filter2 - filter1); + filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2); + + /* Estimations */ +#define ESTIMATE(fmt, var, what) do { \ + var = (what); \ + printf("Estimated " fmt ": %llu ns\n", var); \ + if (var > INT_MAX) \ + goto more_samples; \ + } while (0) + + ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc, + bitmap1 - native); + ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc, + bitmap2 - native); + ESTIMATE("total seccomp overhead for 3 full filters", calc, + filter1 - native); + ESTIMATE("total seccomp overhead for 4 full filters", calc, + filter2 - native); + ESTIMATE("seccomp entry overhead", entry, + bitmap1 - native - (bitmap2 - bitmap1)); + ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1, + filter2 - filter1); + ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2, + (filter2 - native - entry) / 4); + + printf("Expectations:\n"); + ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1); + bits = compare("native", "≤", "1 filter", native, le, filter1); + if (bits) + goto more_samples; + + ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)", + per_filter1, approx, per_filter2); + + bits = compare("1 bitmapped", "≈", "2 bitmapped", + bitmap1 - native, approx, bitmap2 - native); + if (bits) { + printf("Skipping constant action bitmap expectations: they appear unsupported.\n"); + goto out; + } - printf("Estimated seccomp entry overhead: %llu ns\n", - filter1 - native - (filter2 - filter1)); + ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native); + ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native); + ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total", + entry + (per_filter1 * 4) + native, approx, filter2); + if (ret == 0) + goto out; +more_samples: + printf("Saw unexpected benchmark result. Try running again with more samples?\n"); +out: return 0; } diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings index ba4d85f74cd6..6091b45d226b 100644 --- a/tools/testing/selftests/seccomp/settings +++ b/tools/testing/selftests/seccomp/settings @@ -1 +1 @@ -timeout=90 +timeout=120 -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu ` (3 preceding siblings ...) 2020-09-30 15:19 ` [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu @ 2020-09-30 15:19 ` YiFei Zhu 2020-09-30 22:00 ` Jann Horn 2020-09-30 22:59 ` Kees Cook 2020-10-09 17:14 ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 5 siblings, 2 replies; 149+ messages in thread From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Currently the kernel does not provide an infrastructure to translate architecture numbers to a human-readable name. Translating syscall numbers to syscall names is possible through FTRACE_SYSCALL infrastructure but it does not provide support for compat syscalls. This will create a file for each PID as /proc/pid/seccomp_cache. The file will be empty when no seccomp filters are loaded, or be in the format of: <arch name> <decimal syscall number> <ALLOW | FILTER> where ALLOW means the cache is guaranteed to allow the syscall, and filter means the cache will pass the syscall to the BPF filter. For the docker default profile on x86_64 it looks like: x86_64 0 ALLOW x86_64 1 ALLOW x86_64 2 ALLOW x86_64 3 ALLOW [...] x86_64 132 ALLOW x86_64 133 ALLOW x86_64 134 FILTER x86_64 135 FILTER x86_64 136 FILTER x86_64 137 ALLOW x86_64 138 ALLOW x86_64 139 FILTER x86_64 140 ALLOW x86_64 141 ALLOW [...] This file is guarded by CONFIG_DEBUG_SECCOMP_CACHE with a default of N because I think certain users of seccomp might not want the application to know which syscalls are definitely usable. For the same reason, it is also guarded by CAP_SYS_ADMIN. Suggested-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/Kconfig | 15 +++++++++++ arch/x86/include/asm/seccomp.h | 3 +++ fs/proc/base.c | 3 +++ include/linux/seccomp.h | 5 ++++ kernel/seccomp.c | 46 ++++++++++++++++++++++++++++++++++ 5 files changed, 72 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index ca867b2a5d71..b840cadcc882 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -478,6 +478,7 @@ config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY - all the requirements for HAVE_ARCH_SECCOMP_FILTER - SECCOMP_ARCH_DEFAULT - SECCOMP_ARCH_DEFAULT_NR + - SECCOMP_ARCH_DEFAULT_NAME config SECCOMP prompt "Enable seccomp to safely execute untrusted bytecode" @@ -532,6 +533,20 @@ config SECCOMP_CACHE_NR_ONLY endchoice +config DEBUG_SECCOMP_CACHE + bool "Show seccomp filter cache status in /proc/pid/seccomp_cache" + depends on SECCOMP_CACHE_NR_ONLY + depends on PROC_FS + help + This is enables /proc/pid/seccomp_cache interface to monitor + seccomp cache data. The file format is subject to change. Reading + the file requires CAP_SYS_ADMIN. + + This option is for debugging only. Enabling present the risk that + an adversary may be able to infer the seccomp filter logic. + + If unsure, say N. + config HAVE_ARCH_STACKLEAK bool help diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h index 7b3a58271656..33ccc074be7a 100644 --- a/arch/x86/include/asm/seccomp.h +++ b/arch/x86/include/asm/seccomp.h @@ -19,13 +19,16 @@ #ifdef CONFIG_X86_64 # define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_X86_64 # define SECCOMP_ARCH_DEFAULT_NR NR_syscalls +# define SECCOMP_ARCH_DEFAULT_NAME "x86_64" # ifdef CONFIG_COMPAT # define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 # define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls +# define SECCOMP_ARCH_COMPAT_NAME "x86_32" # endif #else /* !CONFIG_X86_64 */ # define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_I386 # define SECCOMP_ARCH_DEFAULT_NR NR_syscalls +# define SECCOMP_ARCH_COMPAT_NAME "x86_32" #endif #include <asm-generic/seccomp.h> diff --git a/fs/proc/base.c b/fs/proc/base.c index 617db4e0faa0..c60c5fce70fa 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_PROC_PID_ARCH_STATUS ONE("arch_status", S_IRUGO, proc_pid_arch_status), #endif +#ifdef CONFIG_DEBUG_SECCOMP_CACHE + ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache), +#endif }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 02aef2844c38..c35430f5f553 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -121,4 +121,9 @@ static inline long seccomp_get_metadata(struct task_struct *task, return -EINVAL; } #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */ + +#ifdef CONFIG_DEBUG_SECCOMP_CACHE +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task); +#endif #endif /* _LINUX_SECCOMP_H */ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index bed3b2a7f6c8..c5ca5e30281b 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -2297,3 +2297,49 @@ static int __init seccomp_sysctl_init(void) device_initcall(seccomp_sysctl_init) #endif /* CONFIG_SYSCTL */ + +#ifdef CONFIG_DEBUG_SECCOMP_CACHE +/* Currently CONFIG_DEBUG_SECCOMP_CACHE implies CONFIG_SECCOMP_CACHE_NR_ONLY */ +static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name, + const void *bitmap, size_t bitmap_size) +{ + int nr; + + for (nr = 0; nr < bitmap_size; nr++) { + bool cached = test_bit(nr, bitmap); + char *status = cached ? "ALLOW" : "FILTER"; + + seq_printf(m, "%s %d %s\n", name, nr, status); + } +} + +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task) +{ + struct seccomp_filter *f; + + /* + * We don't want some sandboxed process know what their seccomp + * filters consist of. + */ + if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)) + return -EACCES; + + f = READ_ONCE(task->seccomp.filter); + if (!f) + return 0; + +#ifdef SECCOMP_ARCH_DEFAULT + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_DEFAULT_NAME, + f->cache.syscall_allow_default, + SECCOMP_ARCH_DEFAULT_NR); +#endif /* SECCOMP_ARCH_DEFAULT */ + +#ifdef SECCOMP_ARCH_COMPAT + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME, + f->cache.syscall_allow_compat, + SECCOMP_ARCH_COMPAT_NR); +#endif /* SECCOMP_ARCH_COMPAT */ + return 0; +} +#endif /* CONFIG_DEBUG_SECCOMP_CACHE */ -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-30 15:19 ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu @ 2020-09-30 22:00 ` Jann Horn 2020-09-30 23:12 ` Kees Cook 2020-10-01 12:06 ` YiFei Zhu 2020-09-30 22:59 ` Kees Cook 1 sibling, 2 replies; 149+ messages in thread From: Jann Horn @ 2020-09-30 22:00 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > Currently the kernel does not provide an infrastructure to translate > architecture numbers to a human-readable name. Translating syscall > numbers to syscall names is possible through FTRACE_SYSCALL > infrastructure but it does not provide support for compat syscalls. > > This will create a file for each PID as /proc/pid/seccomp_cache. > The file will be empty when no seccomp filters are loaded, or be > in the format of: > <arch name> <decimal syscall number> <ALLOW | FILTER> > where ALLOW means the cache is guaranteed to allow the syscall, > and filter means the cache will pass the syscall to the BPF filter. > > For the docker default profile on x86_64 it looks like: > x86_64 0 ALLOW > x86_64 1 ALLOW > x86_64 2 ALLOW > x86_64 3 ALLOW > [...] > x86_64 132 ALLOW > x86_64 133 ALLOW > x86_64 134 FILTER > x86_64 135 FILTER > x86_64 136 FILTER > x86_64 137 ALLOW > x86_64 138 ALLOW > x86_64 139 FILTER > x86_64 140 ALLOW > x86_64 141 ALLOW > [...] Oooh, neat! :) Thanks! > Suggested-by: Jann Horn <jannh@google.com> > Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > --- > arch/Kconfig | 15 +++++++++++ > arch/x86/include/asm/seccomp.h | 3 +++ > fs/proc/base.c | 3 +++ > include/linux/seccomp.h | 5 ++++ > kernel/seccomp.c | 46 ++++++++++++++++++++++++++++++++++ > 5 files changed, 72 insertions(+) > > diff --git a/arch/Kconfig b/arch/Kconfig > index ca867b2a5d71..b840cadcc882 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -478,6 +478,7 @@ config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > - all the requirements for HAVE_ARCH_SECCOMP_FILTER > - SECCOMP_ARCH_DEFAULT > - SECCOMP_ARCH_DEFAULT_NR > + - SECCOMP_ARCH_DEFAULT_NAME > > config SECCOMP > prompt "Enable seccomp to safely execute untrusted bytecode" > @@ -532,6 +533,20 @@ config SECCOMP_CACHE_NR_ONLY > > endchoice > > +config DEBUG_SECCOMP_CACHE > + bool "Show seccomp filter cache status in /proc/pid/seccomp_cache" > + depends on SECCOMP_CACHE_NR_ONLY > + depends on PROC_FS > + help > + This is enables /proc/pid/seccomp_cache interface to monitor nit: s/is enables/enables/ > + seccomp cache data. The file format is subject to change. Reading > + the file requires CAP_SYS_ADMIN. > + > + This option is for debugging only. Enabling present the risk that > + an adversary may be able to infer the seccomp filter logic. > + > + If unsure, say N. > + [...] > diff --git a/kernel/seccomp.c b/kernel/seccomp.c [...] > +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, > + struct pid *pid, struct task_struct *task) > +{ > + struct seccomp_filter *f; > + > + /* > + * We don't want some sandboxed process know what their seccomp > + * filters consist of. > + */ > + if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)) > + return -EACCES; > + > + f = READ_ONCE(task->seccomp.filter); > + if (!f) > + return 0; Hmm, this won't work, because the task could be exiting, and seccomp filters are detached in release_task() (using seccomp_filter_release()). And at the moment, seccomp_filter_release() just locklessly NULLs out the tsk->seccomp.filter pointer and drops the reference. The locking here is kind of gross, but basically I think you can change this code to use lock_task_sighand() / unlock_task_sighand() (see the other examples in fs/proc/base.c), and bail out if lock_task_sighand() returns NULL. And in seccomp_filter_release(), add something like this: /* We are effectively holding the siglock by not having any sighand. */ WARN_ON(tsk->sighand != NULL); > +#ifdef SECCOMP_ARCH_DEFAULT > + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_DEFAULT_NAME, > + f->cache.syscall_allow_default, > + SECCOMP_ARCH_DEFAULT_NR); > +#endif /* SECCOMP_ARCH_DEFAULT */ > + > +#ifdef SECCOMP_ARCH_COMPAT > + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME, > + f->cache.syscall_allow_compat, > + SECCOMP_ARCH_COMPAT_NR); > +#endif /* SECCOMP_ARCH_COMPAT */ > + return 0; > +} > +#endif /* CONFIG_DEBUG_SECCOMP_CACHE */ > -- > 2.28.0 > ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-30 22:00 ` Jann Horn @ 2020-09-30 23:12 ` Kees Cook 2020-10-01 12:06 ` YiFei Zhu 1 sibling, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-09-30 23:12 UTC (permalink / raw) To: Jann Horn Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 01, 2020 at 12:00:46AM +0200, Jann Horn wrote: > On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > [...] > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > [...] > > +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, > > + struct pid *pid, struct task_struct *task) > > +{ > > + struct seccomp_filter *f; > > + > > + /* > > + * We don't want some sandboxed process know what their seccomp > > + * filters consist of. > > + */ > > + if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)) > > + return -EACCES; > > + > > + f = READ_ONCE(task->seccomp.filter); > > + if (!f) > > + return 0; > > Hmm, this won't work, because the task could be exiting, and seccomp > filters are detached in release_task() (using > seccomp_filter_release()). And at the moment, seccomp_filter_release() > just locklessly NULLs out the tsk->seccomp.filter pointer and drops > the reference. Oh nice catch. Yeah, this would only happen if it was the only filter remaining on a process with no children, etc. > > The locking here is kind of gross, but basically I think you can > change this code to use lock_task_sighand() / unlock_task_sighand() > (see the other examples in fs/proc/base.c), and bail out if > lock_task_sighand() returns NULL. And in seccomp_filter_release(), add > something like this: > > /* We are effectively holding the siglock by not having any sighand. */ > WARN_ON(tsk->sighand != NULL); Yeah, good idea. -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-30 22:00 ` Jann Horn 2020-09-30 23:12 ` Kees Cook @ 2020-10-01 12:06 ` YiFei Zhu 2020-10-01 16:05 ` Jann Horn 1 sibling, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-01 12:06 UTC (permalink / raw) To: Jann Horn Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 5:01 PM Jann Horn <jannh@google.com> wrote: > Hmm, this won't work, because the task could be exiting, and seccomp > filters are detached in release_task() (using > seccomp_filter_release()). And at the moment, seccomp_filter_release() > just locklessly NULLs out the tsk->seccomp.filter pointer and drops > the reference. > > The locking here is kind of gross, but basically I think you can > change this code to use lock_task_sighand() / unlock_task_sighand() > (see the other examples in fs/proc/base.c), and bail out if > lock_task_sighand() returns NULL. And in seccomp_filter_release(), add > something like this: > > /* We are effectively holding the siglock by not having any sighand. */ > WARN_ON(tsk->sighand != NULL); Ah thanks. I was thinking about how tasks exit and get freed and that sort of stuff, and how this would race against them. The last time I worked with procfs there was some magic going on that I could not figure out, so I was thinking if some magic will stop the task_struct from being released, considering it's an argument here. I just looked at release_task and related functions; looks like it will, at the end, decrease the reference count of the task_struct. Does procfs increase the refcount while calling the procfs functions? Hence, in procfs functions one can rely on the task_struct still being a task_struct, but any direct effects of release_task may happen while the procfs functions are running? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-01 12:06 ` YiFei Zhu @ 2020-10-01 16:05 ` Jann Horn 2020-10-01 16:18 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Jann Horn @ 2020-10-01 16:05 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 1, 2020 at 2:06 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > On Wed, Sep 30, 2020 at 5:01 PM Jann Horn <jannh@google.com> wrote: > > Hmm, this won't work, because the task could be exiting, and seccomp > > filters are detached in release_task() (using > > seccomp_filter_release()). And at the moment, seccomp_filter_release() > > just locklessly NULLs out the tsk->seccomp.filter pointer and drops > > the reference. > > > > The locking here is kind of gross, but basically I think you can > > change this code to use lock_task_sighand() / unlock_task_sighand() > > (see the other examples in fs/proc/base.c), and bail out if > > lock_task_sighand() returns NULL. And in seccomp_filter_release(), add > > something like this: > > > > /* We are effectively holding the siglock by not having any sighand. */ > > WARN_ON(tsk->sighand != NULL); > > Ah thanks. I was thinking about how tasks exit and get freed and that > sort of stuff, and how this would race against them. The last time I > worked with procfs there was some magic going on that I could not > figure out, so I was thinking if some magic will stop the task_struct > from being released, considering it's an argument here. > > I just looked at release_task and related functions; looks like it > will, at the end, decrease the reference count of the task_struct. > Does procfs increase the refcount while calling the procfs functions? > Hence, in procfs functions one can rely on the task_struct still being > a task_struct, but any direct effects of release_task may happen while > the procfs functions are running? Yeah. The ONE() entry you're adding to tgid_base_stuff is used to help instantiate a "struct inode" when someone looks up the path "/proc/$tgid/seccomp_cache"; then when that path is opened, a "struct file" is created that holds a reference to the inode; and while that file exists, your proc_pid_seccomp_cache() can be invoked. proc_pid_seccomp_cache() is invoked from proc_single_show() ("PROC_I(inode)->op.proc_show" is proc_pid_seccomp_cache), and proc_single_show() obtains a temporary reference to the task_struct using get_pid_task() on a "struct pid" and drops that reference afterwards with put_task_struct(). The "struct pid" is obtained from the "struct proc_inode", which is essentially a subclass of "struct inode". The "struct pid" is kept refererenced until the inode goes away, via proc_pid_evict_inode(), called by proc_evict_inode(). By looking at put_task_struct() and its callees, you can figure out which parts of the "struct task" are kept alive by the reference to it. By the way, maybe it'd make sense to add this to tid_base_stuff as well? That should just be one extra line of code. Seccomp filters are technically per-thread, so it would make sense to have them visible in the per-thread subdirectories /proc/$pid/task/$tid/. ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-01 16:05 ` Jann Horn @ 2020-10-01 16:18 ` YiFei Zhu 0 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-10-01 16:18 UTC (permalink / raw) To: Jann Horn Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 1, 2020 at 11:05 AM Jann Horn <jannh@google.com> wrote: > Yeah. > > The ONE() entry you're adding to tgid_base_stuff is used to help > instantiate a "struct inode" when someone looks up the path > "/proc/$tgid/seccomp_cache"; then when that path is opened, a "struct > file" is created that holds a reference to the inode; and while that > file exists, your proc_pid_seccomp_cache() can be invoked. > > proc_pid_seccomp_cache() is invoked from proc_single_show() > ("PROC_I(inode)->op.proc_show" is proc_pid_seccomp_cache), and > proc_single_show() obtains a temporary reference to the task_struct > using get_pid_task() on a "struct pid" and drops that reference > afterwards with put_task_struct(). The "struct pid" is obtained from > the "struct proc_inode", which is essentially a subclass of "struct > inode". The "struct pid" is kept refererenced until the inode goes > away, via proc_pid_evict_inode(), called by proc_evict_inode(). > > By looking at put_task_struct() and its callees, you can figure out > which parts of the "struct task" are kept alive by the reference to > it. Ah I see. Thanks for the explanation. > By the way, maybe it'd make sense to add this to tid_base_stuff as > well? That should just be one extra line of code. Seccomp filters are > technically per-thread, so it would make sense to have them visible in > the per-thread subdirectories /proc/$pid/task/$tid/. Right. Will do. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-30 15:19 ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-09-30 22:00 ` Jann Horn @ 2020-09-30 22:59 ` Kees Cook 2020-09-30 23:08 ` Jann Horn 1 sibling, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-09-30 22:59 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Sep 30, 2020 at 10:19:16AM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > Currently the kernel does not provide an infrastructure to translate > architecture numbers to a human-readable name. Translating syscall > numbers to syscall names is possible through FTRACE_SYSCALL > infrastructure but it does not provide support for compat syscalls. > > This will create a file for each PID as /proc/pid/seccomp_cache. > The file will be empty when no seccomp filters are loaded, or be > in the format of: > <arch name> <decimal syscall number> <ALLOW | FILTER> > where ALLOW means the cache is guaranteed to allow the syscall, > and filter means the cache will pass the syscall to the BPF filter. > > For the docker default profile on x86_64 it looks like: > x86_64 0 ALLOW > x86_64 1 ALLOW > x86_64 2 ALLOW > x86_64 3 ALLOW > [...] > x86_64 132 ALLOW > x86_64 133 ALLOW > x86_64 134 FILTER > x86_64 135 FILTER > x86_64 136 FILTER > x86_64 137 ALLOW > x86_64 138 ALLOW > x86_64 139 FILTER > x86_64 140 ALLOW > x86_64 141 ALLOW > [...] > > This file is guarded by CONFIG_DEBUG_SECCOMP_CACHE with a default > of N because I think certain users of seccomp might not want the > application to know which syscalls are definitely usable. For > the same reason, it is also guarded by CAP_SYS_ADMIN. > > Suggested-by: Jann Horn <jannh@google.com> > Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > --- > arch/Kconfig | 15 +++++++++++ > arch/x86/include/asm/seccomp.h | 3 +++ > fs/proc/base.c | 3 +++ > include/linux/seccomp.h | 5 ++++ > kernel/seccomp.c | 46 ++++++++++++++++++++++++++++++++++ > 5 files changed, 72 insertions(+) > > diff --git a/arch/Kconfig b/arch/Kconfig > index ca867b2a5d71..b840cadcc882 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -478,6 +478,7 @@ config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > - all the requirements for HAVE_ARCH_SECCOMP_FILTER > - SECCOMP_ARCH_DEFAULT > - SECCOMP_ARCH_DEFAULT_NR > + - SECCOMP_ARCH_DEFAULT_NAME > > config SECCOMP > prompt "Enable seccomp to safely execute untrusted bytecode" > @@ -532,6 +533,20 @@ config SECCOMP_CACHE_NR_ONLY > > endchoice > > +config DEBUG_SECCOMP_CACHE naming nit: I prefer where what how order, so SECCOMP_CACHE_DEBUG. > + bool "Show seccomp filter cache status in /proc/pid/seccomp_cache" > + depends on SECCOMP_CACHE_NR_ONLY > + depends on PROC_FS > + help > + This is enables /proc/pid/seccomp_cache interface to monitor > + seccomp cache data. The file format is subject to change. Reading > + the file requires CAP_SYS_ADMIN. > + > + This option is for debugging only. Enabling present the risk that > + an adversary may be able to infer the seccomp filter logic. > + > + If unsure, say N. > + > config HAVE_ARCH_STACKLEAK > bool > help > diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h > index 7b3a58271656..33ccc074be7a 100644 > --- a/arch/x86/include/asm/seccomp.h > +++ b/arch/x86/include/asm/seccomp.h > @@ -19,13 +19,16 @@ > #ifdef CONFIG_X86_64 > # define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_X86_64 > # define SECCOMP_ARCH_DEFAULT_NR NR_syscalls > +# define SECCOMP_ARCH_DEFAULT_NAME "x86_64" > # ifdef CONFIG_COMPAT > # define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 > # define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls > +# define SECCOMP_ARCH_COMPAT_NAME "x86_32" I think this should be "ia32"? Is there a good definitive guide on this naming convention? -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-30 22:59 ` Kees Cook @ 2020-09-30 23:08 ` Jann Horn 2020-09-30 23:21 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: Jann Horn @ 2020-09-30 23:08 UTC (permalink / raw) To: Kees Cook Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry, Thomas Gleixner, Ingo Molnar, Borislav Petkov, the arch/x86 maintainers [adding x86 folks to enhance bikeshedding] On Thu, Oct 1, 2020 at 12:59 AM Kees Cook <keescook@chromium.org> wrote: > On Wed, Sep 30, 2020 at 10:19:16AM -0500, YiFei Zhu wrote: > > From: YiFei Zhu <yifeifz2@illinois.edu> > > > > Currently the kernel does not provide an infrastructure to translate > > architecture numbers to a human-readable name. Translating syscall > > numbers to syscall names is possible through FTRACE_SYSCALL > > infrastructure but it does not provide support for compat syscalls. > > > > This will create a file for each PID as /proc/pid/seccomp_cache. > > The file will be empty when no seccomp filters are loaded, or be > > in the format of: > > <arch name> <decimal syscall number> <ALLOW | FILTER> > > where ALLOW means the cache is guaranteed to allow the syscall, > > and filter means the cache will pass the syscall to the BPF filter. > > > > For the docker default profile on x86_64 it looks like: > > x86_64 0 ALLOW > > x86_64 1 ALLOW > > x86_64 2 ALLOW > > x86_64 3 ALLOW > > [...] > > x86_64 132 ALLOW > > x86_64 133 ALLOW > > x86_64 134 FILTER > > x86_64 135 FILTER > > x86_64 136 FILTER > > x86_64 137 ALLOW > > x86_64 138 ALLOW > > x86_64 139 FILTER > > x86_64 140 ALLOW > > x86_64 141 ALLOW [...] > > diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h > > index 7b3a58271656..33ccc074be7a 100644 > > --- a/arch/x86/include/asm/seccomp.h > > +++ b/arch/x86/include/asm/seccomp.h > > @@ -19,13 +19,16 @@ > > #ifdef CONFIG_X86_64 > > # define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_X86_64 > > # define SECCOMP_ARCH_DEFAULT_NR NR_syscalls > > +# define SECCOMP_ARCH_DEFAULT_NAME "x86_64" > > # ifdef CONFIG_COMPAT > > # define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 > > # define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls > > +# define SECCOMP_ARCH_COMPAT_NAME "x86_32" > > I think this should be "ia32"? Is there a good definitive guide on this > naming convention? "man 2 syscall" calls them "x86-64" and "i386". The syscall table files use ABI names "i386" and "64". The syscall stub prefixes use "x64" and "ia32". I don't think we have a good consistent naming strategy here. :P ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-09-30 23:08 ` Jann Horn @ 2020-09-30 23:21 ` Kees Cook 0 siblings, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-09-30 23:21 UTC (permalink / raw) To: Jann Horn Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry, Thomas Gleixner, Ingo Molnar, Borislav Petkov, the arch/x86 maintainers On Thu, Oct 01, 2020 at 01:08:04AM +0200, Jann Horn wrote: > [adding x86 folks to enhance bikeshedding] > > On Thu, Oct 1, 2020 at 12:59 AM Kees Cook <keescook@chromium.org> wrote: > > On Wed, Sep 30, 2020 at 10:19:16AM -0500, YiFei Zhu wrote: > > > From: YiFei Zhu <yifeifz2@illinois.edu> > > > > > > Currently the kernel does not provide an infrastructure to translate > > > architecture numbers to a human-readable name. Translating syscall > > > numbers to syscall names is possible through FTRACE_SYSCALL > > > infrastructure but it does not provide support for compat syscalls. > > > > > > This will create a file for each PID as /proc/pid/seccomp_cache. > > > The file will be empty when no seccomp filters are loaded, or be > > > in the format of: > > > <arch name> <decimal syscall number> <ALLOW | FILTER> > > > where ALLOW means the cache is guaranteed to allow the syscall, > > > and filter means the cache will pass the syscall to the BPF filter. > > > > > > For the docker default profile on x86_64 it looks like: > > > x86_64 0 ALLOW > > > x86_64 1 ALLOW > > > x86_64 2 ALLOW > > > x86_64 3 ALLOW > > > [...] > > > x86_64 132 ALLOW > > > x86_64 133 ALLOW > > > x86_64 134 FILTER > > > x86_64 135 FILTER > > > x86_64 136 FILTER > > > x86_64 137 ALLOW > > > x86_64 138 ALLOW > > > x86_64 139 FILTER > > > x86_64 140 ALLOW > > > x86_64 141 ALLOW > [...] > > > diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h > > > index 7b3a58271656..33ccc074be7a 100644 > > > --- a/arch/x86/include/asm/seccomp.h > > > +++ b/arch/x86/include/asm/seccomp.h > > > @@ -19,13 +19,16 @@ > > > #ifdef CONFIG_X86_64 > > > # define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_X86_64 > > > # define SECCOMP_ARCH_DEFAULT_NR NR_syscalls > > > +# define SECCOMP_ARCH_DEFAULT_NAME "x86_64" > > > # ifdef CONFIG_COMPAT > > > # define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 > > > # define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls > > > +# define SECCOMP_ARCH_COMPAT_NAME "x86_32" > > > > I think this should be "ia32"? Is there a good definitive guide on this > > naming convention? > > "man 2 syscall" calls them "x86-64" and "i386". The syscall table > files use ABI names "i386" and "64". The syscall stub prefixes use > "x64" and "ia32". > > I don't think we have a good consistent naming strategy here. :P Agreed. And with "i386" being so hopelessly inaccurate, I prefer "ia32" ... *shrug* I would hope we don't have to be super-pedantic and call them "x86-64" and "IA-32". :P -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results 2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu ` (4 preceding siblings ...) 2020-09-30 15:19 ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu @ 2020-10-09 17:14 ` YiFei Zhu 2020-10-09 17:14 ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu ` (5 more replies) 5 siblings, 6 replies; 149+ messages in thread From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/ Major differences from the linked alternative by Kees: * No x32 special-case handling -- not worth the complexity * No caching of denylist -- not worth the complexity * No seccomp arch pinning -- I think this is an independent feature * The bitmaps are part of the filters rather than the task. This series adds a bitmap to cache seccomp filter results if the result permits a syscall and is indepenent of syscall arguments. This visibly decreases seccomp overhead for most common seccomp filters with very little memory footprint. The overhead of running Seccomp filters has been part of some past discussions [1][2][3]. Oftentimes, the filters have a large number of instructions that check syscall numbers one by one and jump based on that. Some users chain BPF filters which further enlarge the overhead. A recent work [6] comprehensively measures the Seccomp overhead and shows that the overhead is non-negligible and has a non-trivial impact on application performance. We observed some common filters, such as docker's [4] or systemd's [5], will make most decisions based only on the syscall numbers, and as past discussions considered, a bitmap where each bit represents a syscall makes most sense for these filters. In order to build this bitmap at filter attach time, each filter is emulated for every syscall (under each possible architecture), and checked for any accesses of struct seccomp_data that are not the "arch" nor "nr" (syscall) members. If only "arch" and "nr" are examined, and the program returns allow, then we can be sure that the filter must return allow independent from syscall arguments. When it is concluded that an allow must occur for the given architecture and syscall pair, seccomp will immediately allow the syscall, bypassing further BPF execution. Ongoing work is to further support arguments with fast hash table lookups. We are investigating the performance of doing so [6], and how to best integrate with the existing seccomp infrastructure. Some benchmarks are performed with results in patch 5, copied below: Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 200000000 syscalls... 129.359381409 - 0.008724424 = 129350656985 (129.4s) getpid native: 646 ns 264.385890006 - 129.360453229 = 135025436777 (135.0s) getpid RET_ALLOW 1 filter (bitmap): 675 ns 399.400511893 - 264.387045901 = 135013465992 (135.0s) getpid RET_ALLOW 2 filters (bitmap): 675 ns 545.872866260 - 399.401718327 = 146471147933 (146.5s) getpid RET_ALLOW 3 filters (full): 732 ns 696.337101319 - 545.874097681 = 150463003638 (150.5s) getpid RET_ALLOW 4 filters (full): 752 ns Estimated total seccomp overhead for 1 bitmapped filter: 29 ns Estimated total seccomp overhead for 2 bitmapped filters: 29 ns Estimated total seccomp overhead for 3 full filters: 86 ns Estimated total seccomp overhead for 4 full filters: 106 ns Estimated seccomp entry overhead: 29 ns Estimated seccomp per-filter overhead (last 2 diff): 20 ns Estimated seccomp per-filter overhead (filters / 4): 19 ns Expectations: native ≤ 1 bitmap (646 ≤ 675): ✔️ native ≤ 1 filter (646 ≤ 732): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️ 1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️ entry ≈ 1 bitmapped (29 ≈ 29): ✔️ entry ≈ 2 bitmapped (29 ≈ 29): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️ v3 -> v4: * Reordered patches * Naming changes * Fixed racing in /proc/pid/seccomp_cache against filter being released from task, using Jann's suggestion of sighand spinlock. * Cache no longer configurable. * Copied some description from cover letter to commit messages. * Used Kees's logic to set clear bits from bitmap, rather than set bits. v2 -> v3: * Added array_index_nospec guards * No more syscall_arches[] array and expecting on loop unrolling. Arches are configured with per-arch seccomp.h. * Moved filter emulation to attach time (from prepare time). * Further simplified emulator, basing on Kees's code. * Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN. v1 -> v2: * Corrected one outdated function documentation. RFC -> v1: * Config made on by default across all arches that could support it. * Added arch numbers array and emulate filter for each arch number, and have a per-arch bitmap. * Massively simplified the emulator so it would only support the common instructions in Kees's list. * Fixed inheriting bitmap across filters (filter->prev is always NULL during prepare). * Stole the selftest from Kees. * Added a /proc/pid/seccomp_cache by Jann's suggestion. Patch 1 implements the test_bit against the bitmaps. Patch 2 implements the emulator that finds if a filter must return allow, Patch 3 adds the arch macros for x86. Patch 4 updates the selftest to better show the new semantics. Patch 5 implements /proc/pid/seccomp_cache. [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ [3] https://github.com/seccomp/libseccomp/issues/116 [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 [6] Draco: Architectural and Operating System Support for System Call Security https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 Kees Cook (2): x86: Enable seccomp architecture tracking selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu (3): seccomp/cache: Lookup syscall allowlist bitmap for fast path seccomp/cache: Add "emulator" to check if filter is constant allow seccomp/cache: Report cache data through /proc/pid/seccomp_cache arch/Kconfig | 24 ++ arch/x86/Kconfig | 1 + arch/x86/include/asm/seccomp.h | 15 + fs/proc/base.c | 6 + include/linux/seccomp.h | 5 + kernel/seccomp.c | 289 +++++++++++++++++- .../selftests/seccomp/seccomp_benchmark.c | 151 +++++++-- tools/testing/selftests/seccomp/settings | 2 +- 8 files changed, 469 insertions(+), 24 deletions(-) -- 2.28.0 ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path 2020-10-09 17:14 ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu @ 2020-10-09 17:14 ` YiFei Zhu 2020-10-09 21:30 ` Jann Horn 2020-10-09 23:18 ` Kees Cook 2020-10-09 17:14 ` [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu ` (4 subsequent siblings) 5 siblings, 2 replies; 149+ messages in thread From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> The overhead of running Seccomp filters has been part of some past discussions [1][2][3]. Oftentimes, the filters have a large number of instructions that check syscall numbers one by one and jump based on that. Some users chain BPF filters which further enlarge the overhead. A recent work [6] comprehensively measures the Seccomp overhead and shows that the overhead is non-negligible and has a non-trivial impact on application performance. We observed some common filters, such as docker's [4] or systemd's [5], will make most decisions based only on the syscall numbers, and as past discussions considered, a bitmap where each bit represents a syscall makes most sense for these filters. The fast (common) path for seccomp should be that the filter permits the syscall to pass through, and failing seccomp is expected to be an exceptional case; it is not expected for userspace to call a denylisted syscall over and over. When it can be concluded that an allow must occur for the given architecture and syscall pair (this determination is introduced in the next commit), seccomp will immediately allow the syscall, bypassing further BPF execution. Each architecture number has its own bitmap. The architecture number in seccomp_data is checked against the defined architecture number constant before proceeding to test the bit against the bitmap with the syscall number as the index of the bit in the bitmap, and if the bit is set, seccomp returns allow. The bitmaps are all clear in this patch and will be initialized in the next commit. [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ [3] https://github.com/seccomp/libseccomp/issues/116 [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 [6] Draco: Architectural and Operating System Support for System Call Security https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- kernel/seccomp.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index ae6b40cc39f4..73f6b6e9a3b0 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -143,6 +143,34 @@ struct notification { struct list_head notifications; }; +#ifdef SECCOMP_ARCH_NATIVE +/** + * struct action_cache - per-filter cache of seccomp actions per + * arch/syscall pair + * + * @allow_native: A bitmap where each bit represents whether the + * filter will always allow the syscall, for the + * native architecture. + * @allow_compat: A bitmap where each bit represents whether the + * filter will always allow the syscall, for the + * compat architecture. + */ +struct action_cache { + DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR); +#ifdef SECCOMP_ARCH_COMPAT + DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR); +#endif +}; +#else +struct action_cache { }; + +static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + return false; +} +#endif /* SECCOMP_ARCH_NATIVE */ + /** * struct seccomp_filter - container for seccomp BPF programs * @@ -298,6 +326,47 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) return 0; } +#ifdef SECCOMP_ARCH_NATIVE +static inline bool seccomp_cache_check_allow_bitmap(const void *bitmap, + size_t bitmap_size, + int syscall_nr) +{ + if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size)) + return false; + syscall_nr = array_index_nospec(syscall_nr, bitmap_size); + + return test_bit(syscall_nr, bitmap); +} + +/** + * seccomp_cache_check_allow - lookup seccomp cache + * @sfilter: The seccomp filter + * @sd: The seccomp data to lookup the cache with + * + * Returns true if the seccomp_data is cached and allowed. + */ +static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + int syscall_nr = sd->nr; + const struct action_cache *cache = &sfilter->cache; + + if (likely(sd->arch == SECCOMP_ARCH_NATIVE)) + return seccomp_cache_check_allow_bitmap(cache->allow_native, + SECCOMP_ARCH_NATIVE_NR, + syscall_nr); +#ifdef SECCOMP_ARCH_COMPAT + if (likely(sd->arch == SECCOMP_ARCH_COMPAT)) + return seccomp_cache_check_allow_bitmap(cache->allow_compat, + SECCOMP_ARCH_COMPAT_NR, + syscall_nr); +#endif /* SECCOMP_ARCH_COMPAT */ + + WARN_ON_ONCE(true); + return false; +} +#endif /* SECCOMP_ARCH_NATIVE */ + /** * seccomp_run_filters - evaluates all seccomp filters against @sd * @sd: optional seccomp data to be passed to filters @@ -320,6 +389,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, if (WARN_ON(f == NULL)) return SECCOMP_RET_KILL_PROCESS; + if (seccomp_cache_check_allow(f, sd)) + return SECCOMP_RET_ALLOW; + /* * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path 2020-10-09 17:14 ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu @ 2020-10-09 21:30 ` Jann Horn 2020-10-09 23:18 ` Kees Cook 1 sibling, 0 replies; 149+ messages in thread From: Jann Horn @ 2020-10-09 21:30 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > The overhead of running Seccomp filters has been part of some past > discussions [1][2][3]. Oftentimes, the filters have a large number > of instructions that check syscall numbers one by one and jump based > on that. Some users chain BPF filters which further enlarge the > overhead. A recent work [6] comprehensively measures the Seccomp > overhead and shows that the overhead is non-negligible and has a > non-trivial impact on application performance. > > We observed some common filters, such as docker's [4] or > systemd's [5], will make most decisions based only on the syscall > numbers, and as past discussions considered, a bitmap where each bit > represents a syscall makes most sense for these filters. > > The fast (common) path for seccomp should be that the filter permits > the syscall to pass through, and failing seccomp is expected to be > an exceptional case; it is not expected for userspace to call a > denylisted syscall over and over. > > When it can be concluded that an allow must occur for the given > architecture and syscall pair (this determination is introduced in > the next commit), seccomp will immediately allow the syscall, > bypassing further BPF execution. > > Each architecture number has its own bitmap. The architecture > number in seccomp_data is checked against the defined architecture > number constant before proceeding to test the bit against the > bitmap with the syscall number as the index of the bit in the > bitmap, and if the bit is set, seccomp returns allow. The bitmaps > are all clear in this patch and will be initialized in the next > commit. [...] > Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> Reviewed-by: Jann Horn <jannh@google.com> ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path 2020-10-09 17:14 ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu 2020-10-09 21:30 ` Jann Horn @ 2020-10-09 23:18 ` Kees Cook 1 sibling, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-10-09 23:18 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 09, 2020 at 12:14:29PM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > The overhead of running Seccomp filters has been part of some past > discussions [1][2][3]. Oftentimes, the filters have a large number > of instructions that check syscall numbers one by one and jump based > on that. Some users chain BPF filters which further enlarge the > overhead. A recent work [6] comprehensively measures the Seccomp > overhead and shows that the overhead is non-negligible and has a > non-trivial impact on application performance. > > We observed some common filters, such as docker's [4] or > systemd's [5], will make most decisions based only on the syscall > numbers, and as past discussions considered, a bitmap where each bit > represents a syscall makes most sense for these filters. > > The fast (common) path for seccomp should be that the filter permits > the syscall to pass through, and failing seccomp is expected to be > an exceptional case; it is not expected for userspace to call a > denylisted syscall over and over. > > When it can be concluded that an allow must occur for the given > architecture and syscall pair (this determination is introduced in > the next commit), seccomp will immediately allow the syscall, > bypassing further BPF execution. > > Each architecture number has its own bitmap. The architecture > number in seccomp_data is checked against the defined architecture > number constant before proceeding to test the bit against the > bitmap with the syscall number as the index of the bit in the > bitmap, and if the bit is set, seccomp returns allow. The bitmaps > are all clear in this patch and will be initialized in the next > commit. > > [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ > [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ > [3] https://github.com/seccomp/libseccomp/issues/116 > [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json > [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 > [6] Draco: Architectural and Operating System Support for System Call Security > https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 > > Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > --- > kernel/seccomp.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 72 insertions(+) > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index ae6b40cc39f4..73f6b6e9a3b0 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -143,6 +143,34 @@ struct notification { > struct list_head notifications; > }; > > +#ifdef SECCOMP_ARCH_NATIVE > +/** > + * struct action_cache - per-filter cache of seccomp actions per > + * arch/syscall pair > + * > + * @allow_native: A bitmap where each bit represents whether the > + * filter will always allow the syscall, for the > + * native architecture. > + * @allow_compat: A bitmap where each bit represents whether the > + * filter will always allow the syscall, for the > + * compat architecture. > + */ > +struct action_cache { > + DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR); > +#ifdef SECCOMP_ARCH_COMPAT > + DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR); > +#endif > +}; > +#else > +struct action_cache { }; > + > +static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter, > + const struct seccomp_data *sd) > +{ > + return false; > +} > +#endif /* SECCOMP_ARCH_NATIVE */ > + > /** > * struct seccomp_filter - container for seccomp BPF programs > * > @@ -298,6 +326,47 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) > return 0; > } > > +#ifdef SECCOMP_ARCH_NATIVE > +static inline bool seccomp_cache_check_allow_bitmap(const void *bitmap, > + size_t bitmap_size, > + int syscall_nr) > +{ > + if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size)) > + return false; > + syscall_nr = array_index_nospec(syscall_nr, bitmap_size); > + > + return test_bit(syscall_nr, bitmap); > +} > + > +/** > + * seccomp_cache_check_allow - lookup seccomp cache > + * @sfilter: The seccomp filter > + * @sd: The seccomp data to lookup the cache with > + * > + * Returns true if the seccomp_data is cached and allowed. > + */ > +static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter, > + const struct seccomp_data *sd) > +{ > + int syscall_nr = sd->nr; > + const struct action_cache *cache = &sfilter->cache; > + > + if (likely(sd->arch == SECCOMP_ARCH_NATIVE)) > + return seccomp_cache_check_allow_bitmap(cache->allow_native, > + SECCOMP_ARCH_NATIVE_NR, > + syscall_nr); > +#ifdef SECCOMP_ARCH_COMPAT > + if (likely(sd->arch == SECCOMP_ARCH_COMPAT)) > + return seccomp_cache_check_allow_bitmap(cache->allow_compat, > + SECCOMP_ARCH_COMPAT_NR, > + syscall_nr); > +#endif /* SECCOMP_ARCH_COMPAT */ > + > + WARN_ON_ONCE(true); > + return false; > +} > +#endif /* SECCOMP_ARCH_NATIVE */ An small optimization for the non-compat case might be to do this to avoid the sd->arch test (which should have no way to ever change in such builds): static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter, const struct seccomp_data *sd) { const struct action_cache *cache = &sfilter->cache; #ifndef SECCOMP_ARCH_COMPAT /* A native-only architecture doesn't need to check sd->arch. */ return seccomp_cache_check_allow_bitmap(cache->allow_native, SECCOMP_ARCH_NATIVE_NR, sd->nr); #else /* SECCOMP_ARCH_COMPAT */ if (likely(sd->arch == SECCOMP_ARCH_NATIVE)) return seccomp_cache_check_allow_bitmap(cache->allow_native, SECCOMP_ARCH_NATIVE_NR, sd->nr); if (likely(sd->arch == SECCOMP_ARCH_COMPAT)) return seccomp_cache_check_allow_bitmap(cache->allow_compat, SECCOMP_ARCH_COMPAT_NR, sd->nr); #endif WARN_ON_ONCE(true); return false; } > + > /** > * seccomp_run_filters - evaluates all seccomp filters against @sd > * @sd: optional seccomp data to be passed to filters > @@ -320,6 +389,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, > if (WARN_ON(f == NULL)) > return SECCOMP_RET_KILL_PROCESS; > > + if (seccomp_cache_check_allow(f, sd)) > + return SECCOMP_RET_ALLOW; > + > /* > * All filters in the list are evaluated and the lowest BPF return > * value always takes priority (ignoring the DATA). > -- > 2.28.0 > This is all looking good; thank you! I'm doing some test builds/runs now. :) -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-10-09 17:14 ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 2020-10-09 17:14 ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu @ 2020-10-09 17:14 ` YiFei Zhu 2020-10-09 21:30 ` Jann Horn 2020-10-09 17:14 ` [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu ` (3 subsequent siblings) 5 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> SECCOMP_CACHE will only operate on syscalls that do not access any syscall arguments or instruction pointer. To facilitate this we need a static analyser to know whether a filter will return allow regardless of syscall arguments for a given architecture number / syscall number pair. This is implemented here with a pseudo-emulator, and stored in a per-filter bitmap. In order to build this bitmap at filter attach time, each filter is emulated for every syscall (under each possible architecture), and checked for any accesses of struct seccomp_data that are not the "arch" nor "nr" (syscall) members. If only "arch" and "nr" are examined, and the program returns allow, then we can be sure that the filter must return allow independent from syscall arguments. Nearly all seccomp filters are built from these cBPF instructions: BPF_LD | BPF_W | BPF_ABS BPF_JMP | BPF_JEQ | BPF_K BPF_JMP | BPF_JGE | BPF_K BPF_JMP | BPF_JGT | BPF_K BPF_JMP | BPF_JSET | BPF_K BPF_JMP | BPF_JA BPF_RET | BPF_K BPF_ALU | BPF_AND | BPF_K Each of these instructions are emulated. Any weirdness or loading from a syscall argument will cause the emulator to bail. The emulation is also halted if it reaches a return. In that case, if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. Emulator structure and comments are from Kees [1] and Jann [2]. Emulation is done at attach time. If a filter depends on more filters, and if the dependee does not guarantee to allow the syscall, then we skip the emulation of this syscall. [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ Suggested-by: Jann Horn <jannh@google.com> Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- kernel/seccomp.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 157 insertions(+), 1 deletion(-) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 73f6b6e9a3b0..51032b41fe59 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -169,6 +169,10 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte { return false; } + +static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ +} #endif /* SECCOMP_ARCH_NATIVE */ /** @@ -187,6 +191,7 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte * this filter after reaching 0. The @users count is always smaller * or equal to @refs. Hence, reaching 0 for @users does not mean * the filter can be freed. + * @cache: cache of arch/syscall mappings to actions * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged * @prev: points to a previously installed, or inherited, filter * @prog: the BPF program to evaluate @@ -208,6 +213,7 @@ struct seccomp_filter { refcount_t refs; refcount_t users; bool log; + struct action_cache cache; struct seccomp_filter *prev; struct bpf_prog *prog; struct notification *notif; @@ -616,7 +622,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) { struct seccomp_filter *sfilter; int ret; - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); + const bool save_orig = +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE) + true; +#else + false; +#endif if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) return ERR_PTR(-EINVAL); @@ -682,6 +693,150 @@ seccomp_prepare_user_filter(const char __user *user_filter) return filter; } +#ifdef SECCOMP_ARCH_NATIVE +/** + * seccomp_is_const_allow - check if filter is constant allow with given data + * @fprog: The BPF programs + * @sd: The seccomp data to check against, only syscall number are arch + * number are considered constant. + */ +static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog, + struct seccomp_data *sd) +{ + unsigned int insns; + unsigned int reg_value = 0; + unsigned int pc; + bool op_res; + + if (WARN_ON_ONCE(!fprog)) + return false; + + insns = bpf_classic_proglen(fprog); + for (pc = 0; pc < insns; pc++) { + struct sock_filter *insn = &fprog->filter[pc]; + u16 code = insn->code; + u32 k = insn->k; + + switch (code) { + case BPF_LD | BPF_W | BPF_ABS: + switch (k) { + case offsetof(struct seccomp_data, nr): + reg_value = sd->nr; + break; + case offsetof(struct seccomp_data, arch): + reg_value = sd->arch; + break; + default: + /* can't optimize (non-constant value load) */ + return false; + } + break; + case BPF_RET | BPF_K: + /* reached return with constant values only, check allow */ + return k == SECCOMP_RET_ALLOW; + case BPF_JMP | BPF_JA: + pc += insn->k; + break; + case BPF_JMP | BPF_JEQ | BPF_K: + case BPF_JMP | BPF_JGE | BPF_K: + case BPF_JMP | BPF_JGT | BPF_K: + case BPF_JMP | BPF_JSET | BPF_K: + switch (BPF_OP(code)) { + case BPF_JEQ: + op_res = reg_value == k; + break; + case BPF_JGE: + op_res = reg_value >= k; + break; + case BPF_JGT: + op_res = reg_value > k; + break; + case BPF_JSET: + op_res = !!(reg_value & k); + break; + default: + /* can't optimize (unknown jump) */ + return false; + } + + pc += op_res ? insn->jt : insn->jf; + break; + case BPF_ALU | BPF_AND | BPF_K: + reg_value &= k; + break; + default: + /* can't optimize (unknown insn) */ + return false; + } + } + + /* ran off the end of the filter?! */ + WARN_ON(1); + return false; +} + +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter, + void *bitmap, const void *bitmap_prev, + size_t bitmap_size, int arch) +{ + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; + struct seccomp_data sd; + int nr; + + if (bitmap_prev) { + /* The new filter must be as restrictive as the last. */ + bitmap_copy(bitmap, bitmap_prev, bitmap_size); + } else { + /* Before any filters, all syscalls are always allowed. */ + bitmap_fill(bitmap, bitmap_size); + } + + for (nr = 0; nr < bitmap_size; nr++) { + /* No bitmap change: not a cacheable action. */ + if (!test_bit(nr, bitmap)) + continue; + + sd.nr = nr; + sd.arch = arch; + + /* No bitmap change: continue to always allow. */ + if (seccomp_is_const_allow(fprog, &sd)) + continue; + + /* + * Not a cacheable action: always run filters. + * atomic clear_bit() not needed, filter not visible yet. + */ + __clear_bit(nr, bitmap); + } +} + +/** + * seccomp_cache_prepare - emulate the filter to find cachable syscalls + * @sfilter: The seccomp filter + * + * Returns 0 if successful or -errno if error occurred. + */ +static void seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ + struct action_cache *cache = &sfilter->cache; + const struct action_cache *cache_prev = + sfilter->prev ? &sfilter->prev->cache : NULL; + + seccomp_cache_prepare_bitmap(sfilter, cache->allow_native, + cache_prev ? cache_prev->allow_native : NULL, + SECCOMP_ARCH_NATIVE_NR, + SECCOMP_ARCH_NATIVE); + +#ifdef SECCOMP_ARCH_COMPAT + seccomp_cache_prepare_bitmap(sfilter, cache->allow_compat, + cache_prev ? cache_prev->allow_compat : NULL, + SECCOMP_ARCH_COMPAT_NR, + SECCOMP_ARCH_COMPAT); +#endif /* SECCOMP_ARCH_COMPAT */ +} +#endif /* SECCOMP_ARCH_NATIVE */ + /** * seccomp_attach_filter: validate and attach filter * @flags: flags to change filter behavior @@ -731,6 +886,7 @@ static long seccomp_attach_filter(unsigned int flags, * task reference. */ filter->prev = current->seccomp.filter; + seccomp_cache_prepare(filter); current->seccomp.filter = filter; atomic_inc(¤t->seccomp.filter_count); -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-10-09 17:14 ` [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu @ 2020-10-09 21:30 ` Jann Horn 2020-10-09 22:47 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: Jann Horn @ 2020-10-09 21:30 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > From: YiFei Zhu <yifeifz2@illinois.edu> > > SECCOMP_CACHE will only operate on syscalls that do not access > any syscall arguments or instruction pointer. To facilitate > this we need a static analyser to know whether a filter will > return allow regardless of syscall arguments for a given > architecture number / syscall number pair. This is implemented > here with a pseudo-emulator, and stored in a per-filter bitmap. > > In order to build this bitmap at filter attach time, each filter is > emulated for every syscall (under each possible architecture), and > checked for any accesses of struct seccomp_data that are not the "arch" > nor "nr" (syscall) members. If only "arch" and "nr" are examined, and > the program returns allow, then we can be sure that the filter must > return allow independent from syscall arguments. > > Nearly all seccomp filters are built from these cBPF instructions: > > BPF_LD | BPF_W | BPF_ABS > BPF_JMP | BPF_JEQ | BPF_K > BPF_JMP | BPF_JGE | BPF_K > BPF_JMP | BPF_JGT | BPF_K > BPF_JMP | BPF_JSET | BPF_K > BPF_JMP | BPF_JA > BPF_RET | BPF_K > BPF_ALU | BPF_AND | BPF_K > > Each of these instructions are emulated. Any weirdness or loading > from a syscall argument will cause the emulator to bail. > > The emulation is also halted if it reaches a return. In that case, > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. > > Emulator structure and comments are from Kees [1] and Jann [2]. > > Emulation is done at attach time. If a filter depends on more > filters, and if the dependee does not guarantee to allow the > syscall, then we skip the emulation of this syscall. > > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ [...] > @@ -682,6 +693,150 @@ seccomp_prepare_user_filter(const char __user *user_filter) > return filter; > } > > +#ifdef SECCOMP_ARCH_NATIVE > +/** > + * seccomp_is_const_allow - check if filter is constant allow with given data > + * @fprog: The BPF programs > + * @sd: The seccomp data to check against, only syscall number are arch > + * number are considered constant. nit: s/syscall number are arch number/syscall number and arch number/ > + */ > +static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog, > + struct seccomp_data *sd) > +{ > + unsigned int insns; > + unsigned int reg_value = 0; > + unsigned int pc; > + bool op_res; > + > + if (WARN_ON_ONCE(!fprog)) > + return false; > + > + insns = bpf_classic_proglen(fprog); bpf_classic_proglen() is defined as: #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0])) so this is wrong - what you want is the number of instructions in the program, what you actually have is the size of the program in bytes. Please instead check for `pc < fprog->len` in the loop condition. > + for (pc = 0; pc < insns; pc++) { > + struct sock_filter *insn = &fprog->filter[pc]; > + u16 code = insn->code; > + u32 k = insn->k; [...] > + } > + > + /* ran off the end of the filter?! */ > + WARN_ON(1); > + return false; > +} ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-10-09 21:30 ` Jann Horn @ 2020-10-09 22:47 ` Kees Cook 0 siblings, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-10-09 22:47 UTC (permalink / raw) To: Jann Horn Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 09, 2020 at 11:30:18PM +0200, Jann Horn wrote: > On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > > > From: YiFei Zhu <yifeifz2@illinois.edu> > > > > SECCOMP_CACHE will only operate on syscalls that do not access > > any syscall arguments or instruction pointer. To facilitate > > this we need a static analyser to know whether a filter will > > return allow regardless of syscall arguments for a given > > architecture number / syscall number pair. This is implemented > > here with a pseudo-emulator, and stored in a per-filter bitmap. > > > > In order to build this bitmap at filter attach time, each filter is > > emulated for every syscall (under each possible architecture), and > > checked for any accesses of struct seccomp_data that are not the "arch" > > nor "nr" (syscall) members. If only "arch" and "nr" are examined, and > > the program returns allow, then we can be sure that the filter must > > return allow independent from syscall arguments. > > > > Nearly all seccomp filters are built from these cBPF instructions: > > > > BPF_LD | BPF_W | BPF_ABS > > BPF_JMP | BPF_JEQ | BPF_K > > BPF_JMP | BPF_JGE | BPF_K > > BPF_JMP | BPF_JGT | BPF_K > > BPF_JMP | BPF_JSET | BPF_K > > BPF_JMP | BPF_JA > > BPF_RET | BPF_K > > BPF_ALU | BPF_AND | BPF_K > > > > Each of these instructions are emulated. Any weirdness or loading > > from a syscall argument will cause the emulator to bail. > > > > The emulation is also halted if it reaches a return. In that case, > > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. > > > > Emulator structure and comments are from Kees [1] and Jann [2]. > > > > Emulation is done at attach time. If a filter depends on more > > filters, and if the dependee does not guarantee to allow the > > syscall, then we skip the emulation of this syscall. > > > > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ > > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ > [...] > > @@ -682,6 +693,150 @@ seccomp_prepare_user_filter(const char __user *user_filter) > > return filter; > > } > > > > +#ifdef SECCOMP_ARCH_NATIVE > > +/** > > + * seccomp_is_const_allow - check if filter is constant allow with given data > > + * @fprog: The BPF programs > > + * @sd: The seccomp data to check against, only syscall number are arch > > + * number are considered constant. > > nit: s/syscall number are arch number/syscall number and arch number/ > > > + */ > > +static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog, > > + struct seccomp_data *sd) > > +{ > > + unsigned int insns; > > + unsigned int reg_value = 0; > > + unsigned int pc; > > + bool op_res; > > + > > + if (WARN_ON_ONCE(!fprog)) > > + return false; > > + > > + insns = bpf_classic_proglen(fprog); > > bpf_classic_proglen() is defined as: > > #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0])) > > so this is wrong - what you want is the number of instructions in the > program, what you actually have is the size of the program in bytes. > Please instead check for `pc < fprog->len` in the loop condition. Oh yes, good catch. I had this wrong in my v1. -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking 2020-10-09 17:14 ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 2020-10-09 17:14 ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu 2020-10-09 17:14 ` [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu @ 2020-10-09 17:14 ` YiFei Zhu 2020-10-09 17:25 ` Andy Lutomirski 2020-10-09 17:14 ` [PATCH v4 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu ` (2 subsequent siblings) 5 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: Kees Cook <keescook@chromium.org> Provide seccomp internals with the details to calculate which syscall table the running kernel is expecting to deal with. This allows for efficient architecture pinning and paves the way for constant-action bitmaps. Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/x86/include/asm/seccomp.h | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h index 2bd1338de236..03365af6165d 100644 --- a/arch/x86/include/asm/seccomp.h +++ b/arch/x86/include/asm/seccomp.h @@ -16,6 +16,18 @@ #define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn #endif +#ifdef CONFIG_X86_64 +# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_X86_64 +# define SECCOMP_ARCH_NATIVE_NR NR_syscalls +# ifdef CONFIG_COMPAT +# define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 +# define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls +# endif +#else /* !CONFIG_X86_64 */ +# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_I386 +# define SECCOMP_ARCH_NATIVE_NR NR_syscalls +#endif + #include <asm-generic/seccomp.h> #endif /* _ASM_X86_SECCOMP_H */ -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking 2020-10-09 17:14 ` [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu @ 2020-10-09 17:25 ` Andy Lutomirski 2020-10-09 18:32 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Andy Lutomirski @ 2020-10-09 17:25 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, LKML, Aleksa Sarai, Andrea Arcangeli, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 9, 2020 at 10:15 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > From: Kees Cook <keescook@chromium.org> > > Provide seccomp internals with the details to calculate which syscall > table the running kernel is expecting to deal with. This allows for > efficient architecture pinning and paves the way for constant-action > bitmaps. > > Signed-off-by: Kees Cook <keescook@chromium.org> > Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > --- > arch/x86/include/asm/seccomp.h | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h > index 2bd1338de236..03365af6165d 100644 > --- a/arch/x86/include/asm/seccomp.h > +++ b/arch/x86/include/asm/seccomp.h > @@ -16,6 +16,18 @@ > #define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn > #endif > > +#ifdef CONFIG_X86_64 > +# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_X86_64 > +# define SECCOMP_ARCH_NATIVE_NR NR_syscalls > +# ifdef CONFIG_COMPAT > +# define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 > +# define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls > +# endif > +#else /* !CONFIG_X86_64 */ > +# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_I386 > +# define SECCOMP_ARCH_NATIVE_NR NR_syscalls > +#endif Is the idea that any syscall that's out of range for this (e.g. all of the x32 syscalls) is unoptimized? I'm okay with this, but I think it could use a comment. > + > #include <asm-generic/seccomp.h> > > #endif /* _ASM_X86_SECCOMP_H */ > -- > 2.28.0 > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking 2020-10-09 17:25 ` Andy Lutomirski @ 2020-10-09 18:32 ` YiFei Zhu 2020-10-09 20:59 ` Andy Lutomirski 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-09 18:32 UTC (permalink / raw) To: Andy Lutomirski Cc: Linux Containers, YiFei Zhu, bpf, LKML, Aleksa Sarai, Andrea Arcangeli, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 9, 2020 at 12:25 PM Andy Lutomirski <luto@amacapital.net> wrote: > Is the idea that any syscall that's out of range for this (e.g. all of > the x32 syscalls) is unoptimized? I'm okay with this, but I think it > could use a comment. Yes, any syscall number that is out of range is unoptimized. Where do you think I should put a comment? seccomp_cache_check_allow_bitmap above `if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))`, with something like "any syscall number out of range is unoptimized"? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking 2020-10-09 18:32 ` YiFei Zhu @ 2020-10-09 20:59 ` Andy Lutomirski 0 siblings, 0 replies; 149+ messages in thread From: Andy Lutomirski @ 2020-10-09 20:59 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, LKML, Aleksa Sarai, Andrea Arcangeli, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 9, 2020 at 11:32 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > On Fri, Oct 9, 2020 at 12:25 PM Andy Lutomirski <luto@amacapital.net> wrote: > > Is the idea that any syscall that's out of range for this (e.g. all of > > the x32 syscalls) is unoptimized? I'm okay with this, but I think it > > could use a comment. > > Yes, any syscall number that is out of range is unoptimized. Where do > you think I should put a comment? seccomp_cache_check_allow_bitmap > above `if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))`, > with something like "any syscall number out of range is unoptimized"? > I was imagining a comment near the new macros explaining that this is the range of syscalls that seccomp will optimize, that behavior is still correct (albeit slower) for out of range syscalls, and that x32 is intentionally not optimized. This avoids people like future me reading this code, not remembering the context, and thinking it looks buggy. ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v4 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead 2020-10-09 17:14 ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu ` (2 preceding siblings ...) 2020-10-09 17:14 ` [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu @ 2020-10-09 17:14 ` YiFei Zhu 2020-10-09 17:14 ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 5 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: Kees Cook <keescook@chromium.org> As part of the seccomp benchmarking, include the expectations with regard to the timing behavior of the constant action bitmaps, and report inconsistencies better. Example output with constant action bitmaps on x86: $ sudo ./seccomp_benchmark 100000000 Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 200000000 syscalls... 129.359381409 - 0.008724424 = 129350656985 (129.4s) getpid native: 646 ns 264.385890006 - 129.360453229 = 135025436777 (135.0s) getpid RET_ALLOW 1 filter (bitmap): 675 ns 399.400511893 - 264.387045901 = 135013465992 (135.0s) getpid RET_ALLOW 2 filters (bitmap): 675 ns 545.872866260 - 399.401718327 = 146471147933 (146.5s) getpid RET_ALLOW 3 filters (full): 732 ns 696.337101319 - 545.874097681 = 150463003638 (150.5s) getpid RET_ALLOW 4 filters (full): 752 ns Estimated total seccomp overhead for 1 bitmapped filter: 29 ns Estimated total seccomp overhead for 2 bitmapped filters: 29 ns Estimated total seccomp overhead for 3 full filters: 86 ns Estimated total seccomp overhead for 4 full filters: 106 ns Estimated seccomp entry overhead: 29 ns Estimated seccomp per-filter overhead (last 2 diff): 20 ns Estimated seccomp per-filter overhead (filters / 4): 19 ns Expectations: native ≤ 1 bitmap (646 ≤ 675): ✔️ native ≤ 1 filter (646 ≤ 732): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️ 1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️ entry ≈ 1 bitmapped (29 ≈ 29): ✔️ entry ≈ 2 bitmapped (29 ≈ 29): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️ Signed-off-by: Kees Cook <keescook@chromium.org> [YiFei: Changed commit message to show stats for this patch series] Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- .../selftests/seccomp/seccomp_benchmark.c | 151 +++++++++++++++--- tools/testing/selftests/seccomp/settings | 2 +- 2 files changed, 130 insertions(+), 23 deletions(-) diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c index 91f5a89cadac..fcc806585266 100644 --- a/tools/testing/selftests/seccomp/seccomp_benchmark.c +++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c @@ -4,12 +4,16 @@ */ #define _GNU_SOURCE #include <assert.h> +#include <limits.h> +#include <stdbool.h> +#include <stddef.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <linux/filter.h> #include <linux/seccomp.h> +#include <sys/param.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <sys/types.h> @@ -70,18 +74,74 @@ unsigned long long calibrate(void) return samples * seconds; } +bool approx(int i_one, int i_two) +{ + double one = i_one, one_bump = one * 0.01; + double two = i_two, two_bump = two * 0.01; + + one_bump = one + MAX(one_bump, 2.0); + two_bump = two + MAX(two_bump, 2.0); + + /* Equal to, or within 1% or 2 digits */ + if (one == two || + (one > two && one <= two_bump) || + (two > one && two <= one_bump)) + return true; + return false; +} + +bool le(int i_one, int i_two) +{ + if (i_one <= i_two) + return true; + return false; +} + +long compare(const char *name_one, const char *name_eval, const char *name_two, + unsigned long long one, bool (*eval)(int, int), unsigned long long two) +{ + bool good; + + printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two, + (long long)one, name_eval, (long long)two); + if (one > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)one); + return 1; + } + if (two > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)two); + return 1; + } + + good = eval(one, two); + printf("%s\n", good ? "✔️" : "❌"); + + return good ? 0 : 1; +} + int main(int argc, char *argv[]) { + struct sock_filter bitmap_filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), + }; + struct sock_fprog bitmap_prog = { + .len = (unsigned short)ARRAY_SIZE(bitmap_filter), + .filter = bitmap_filter, + }; struct sock_filter filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog prog = { .len = (unsigned short)ARRAY_SIZE(filter), .filter = filter, }; - long ret; - unsigned long long samples; - unsigned long long native, filter1, filter2; + + long ret, bits; + unsigned long long samples, calc; + unsigned long long native, filter1, filter2, bitmap1, bitmap2; + unsigned long long entry, per_filter1, per_filter2; printf("Current BPF sysctl settings:\n"); system("sysctl net.core.bpf_jit_enable"); @@ -101,35 +161,82 @@ int main(int argc, char *argv[]) ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); assert(ret == 0); - /* One filter */ - ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); + /* One filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); assert(ret == 0); - filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1); + bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1); + + /* Second filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - if (filter1 == native) - printf("No overhead measured!? Try running again with more samples.\n"); + bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2); - /* Two filters */ + /* Third filter, can no longer be converted to bitmap */ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); assert(ret == 0); - filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2); - - /* Calculations */ - printf("Estimated total seccomp overhead for 1 filter: %llu ns\n", - filter1 - native); + filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1); - printf("Estimated total seccomp overhead for 2 filters: %llu ns\n", - filter2 - native); + /* Fourth filter, can not be converted to bitmap because of filter 3 */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - printf("Estimated seccomp per-filter overhead: %llu ns\n", - filter2 - filter1); + filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2); + + /* Estimations */ +#define ESTIMATE(fmt, var, what) do { \ + var = (what); \ + printf("Estimated " fmt ": %llu ns\n", var); \ + if (var > INT_MAX) \ + goto more_samples; \ + } while (0) + + ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc, + bitmap1 - native); + ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc, + bitmap2 - native); + ESTIMATE("total seccomp overhead for 3 full filters", calc, + filter1 - native); + ESTIMATE("total seccomp overhead for 4 full filters", calc, + filter2 - native); + ESTIMATE("seccomp entry overhead", entry, + bitmap1 - native - (bitmap2 - bitmap1)); + ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1, + filter2 - filter1); + ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2, + (filter2 - native - entry) / 4); + + printf("Expectations:\n"); + ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1); + bits = compare("native", "≤", "1 filter", native, le, filter1); + if (bits) + goto more_samples; + + ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)", + per_filter1, approx, per_filter2); + + bits = compare("1 bitmapped", "≈", "2 bitmapped", + bitmap1 - native, approx, bitmap2 - native); + if (bits) { + printf("Skipping constant action bitmap expectations: they appear unsupported.\n"); + goto out; + } - printf("Estimated seccomp entry overhead: %llu ns\n", - filter1 - native - (filter2 - filter1)); + ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native); + ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native); + ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total", + entry + (per_filter1 * 4) + native, approx, filter2); + if (ret == 0) + goto out; +more_samples: + printf("Saw unexpected benchmark result. Try running again with more samples?\n"); +out: return 0; } diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings index ba4d85f74cd6..6091b45d226b 100644 --- a/tools/testing/selftests/seccomp/settings +++ b/tools/testing/selftests/seccomp/settings @@ -1 +1 @@ -timeout=90 +timeout=120 -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-09 17:14 ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu ` (3 preceding siblings ...) 2020-10-09 17:14 ` [PATCH v4 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu @ 2020-10-09 17:14 ` YiFei Zhu 2020-10-09 21:45 ` Jann Horn 2020-10-09 23:14 ` Kees Cook 2020-10-11 15:47 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 5 siblings, 2 replies; 149+ messages in thread From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Currently the kernel does not provide an infrastructure to translate architecture numbers to a human-readable name. Translating syscall numbers to syscall names is possible through FTRACE_SYSCALL infrastructure but it does not provide support for compat syscalls. This will create a file for each PID as /proc/pid/seccomp_cache. The file will be empty when no seccomp filters are loaded, or be in the format of: <arch name> <decimal syscall number> <ALLOW | FILTER> where ALLOW means the cache is guaranteed to allow the syscall, and filter means the cache will pass the syscall to the BPF filter. For the docker default profile on x86_64 it looks like: x86_64 0 ALLOW x86_64 1 ALLOW x86_64 2 ALLOW x86_64 3 ALLOW [...] x86_64 132 ALLOW x86_64 133 ALLOW x86_64 134 FILTER x86_64 135 FILTER x86_64 136 FILTER x86_64 137 ALLOW x86_64 138 ALLOW x86_64 139 FILTER x86_64 140 ALLOW x86_64 141 ALLOW [...] This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default of N because I think certain users of seccomp might not want the application to know which syscalls are definitely usable. For the same reason, it is also guarded by CAP_SYS_ADMIN. Suggested-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/Kconfig | 24 ++++++++++++++ arch/x86/Kconfig | 1 + arch/x86/include/asm/seccomp.h | 3 ++ fs/proc/base.c | 6 ++++ include/linux/seccomp.h | 5 +++ kernel/seccomp.c | 59 ++++++++++++++++++++++++++++++++++ 6 files changed, 98 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index 21a3675a7a3a..85239a974f04 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -471,6 +471,15 @@ config HAVE_ARCH_SECCOMP_FILTER results in the system call being skipped immediately. - seccomp syscall wired up +config HAVE_ARCH_SECCOMP_CACHE + bool + help + An arch should select this symbol if it provides all of these things: + - all the requirements for HAVE_ARCH_SECCOMP_FILTER + - SECCOMP_ARCH_NATIVE + - SECCOMP_ARCH_NATIVE_NR + - SECCOMP_ARCH_NATIVE_NAME + config SECCOMP prompt "Enable seccomp to safely execute untrusted bytecode" def_bool y @@ -498,6 +507,21 @@ config SECCOMP_FILTER See Documentation/userspace-api/seccomp_filter.rst for details. +config SECCOMP_CACHE_DEBUG + bool "Show seccomp filter cache status in /proc/pid/seccomp_cache" + depends on SECCOMP + depends on SECCOMP_FILTER + depends on PROC_FS + help + This is enables /proc/pid/seccomp_cache interface to monitor + seccomp cache data. The file format is subject to change. Reading + the file requires CAP_SYS_ADMIN. + + This option is for debugging only. Enabling present the risk that + an adversary may be able to infer the seccomp filter logic. + + If unsure, say N. + config HAVE_ARCH_STACKLEAK bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1ab22869a765..1a807f89ac77 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -150,6 +150,7 @@ config X86 select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_SECCOMP_CACHE select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_STACKLEAK select HAVE_ARCH_TRACEHOOK diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h index 03365af6165d..cd57c3eabab5 100644 --- a/arch/x86/include/asm/seccomp.h +++ b/arch/x86/include/asm/seccomp.h @@ -19,13 +19,16 @@ #ifdef CONFIG_X86_64 # define SECCOMP_ARCH_NATIVE AUDIT_ARCH_X86_64 # define SECCOMP_ARCH_NATIVE_NR NR_syscalls +# define SECCOMP_ARCH_NATIVE_NAME "x86_64" # ifdef CONFIG_COMPAT # define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 # define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls +# define SECCOMP_ARCH_COMPAT_NAME "ia32" # endif #else /* !CONFIG_X86_64 */ # define SECCOMP_ARCH_NATIVE AUDIT_ARCH_I386 # define SECCOMP_ARCH_NATIVE_NR NR_syscalls +# define SECCOMP_ARCH_NATIVE_NAME "ia32" #endif #include <asm-generic/seccomp.h> diff --git a/fs/proc/base.c b/fs/proc/base.c index 617db4e0faa0..a4990410ff05 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_PROC_PID_ARCH_STATUS ONE("arch_status", S_IRUGO, proc_pid_arch_status), #endif +#ifdef CONFIG_SECCOMP_CACHE_DEBUG + ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache), +#endif }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) @@ -3587,6 +3590,9 @@ static const struct pid_entry tid_base_stuff[] = { #ifdef CONFIG_PROC_PID_ARCH_STATUS ONE("arch_status", S_IRUGO, proc_pid_arch_status), #endif +#ifdef CONFIG_SECCOMP_CACHE_DEBUG + ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache), +#endif }; static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 02aef2844c38..1f028d55142a 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -121,4 +121,9 @@ static inline long seccomp_get_metadata(struct task_struct *task, return -EINVAL; } #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */ + +#ifdef CONFIG_SECCOMP_CACHE_DEBUG +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task); +#endif #endif /* _LINUX_SECCOMP_H */ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 51032b41fe59..a75746d259a5 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -548,6 +548,9 @@ void seccomp_filter_release(struct task_struct *tsk) { struct seccomp_filter *orig = tsk->seccomp.filter; + /* We are effectively holding the siglock by not having any sighand. */ + WARN_ON(tsk->sighand != NULL); + /* Detach task from its filter tree. */ tsk->seccomp.filter = NULL; __seccomp_filter_release(orig); @@ -2308,3 +2311,59 @@ static int __init seccomp_sysctl_init(void) device_initcall(seccomp_sysctl_init) #endif /* CONFIG_SYSCTL */ + +#ifdef CONFIG_SECCOMP_CACHE_DEBUG +/* Currently CONFIG_SECCOMP_CACHE_DEBUG implies SECCOMP_ARCH_NATIVE */ +static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name, + const void *bitmap, size_t bitmap_size) +{ + int nr; + + for (nr = 0; nr < bitmap_size; nr++) { + bool cached = test_bit(nr, bitmap); + char *status = cached ? "ALLOW" : "FILTER"; + + seq_printf(m, "%s %d %s\n", name, nr, status); + } +} + +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task) +{ + struct seccomp_filter *f; + unsigned long flags; + + /* + * We don't want some sandboxed process know what their seccomp + * filters consist of. + */ + if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)) + return -EACCES; + + if (!lock_task_sighand(task, &flags)) + return 0; + + f = READ_ONCE(task->seccomp.filter); + if (!f) { + unlock_task_sighand(task, &flags); + return 0; + } + + /* prevent filter from being freed while we are printing it */ + __get_seccomp_filter(f); + unlock_task_sighand(task, &flags); + + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_NATIVE_NAME, + f->cache.allow_native, + SECCOMP_ARCH_NATIVE_NR); + +#ifdef SECCOMP_ARCH_COMPAT + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME, + f->cache.allow_compat, + SECCOMP_ARCH_COMPAT_NR); +#endif /* SECCOMP_ARCH_COMPAT */ + + __put_seccomp_filter(f); + return 0; +} +#endif /* CONFIG_SECCOMP_CACHE_DEBUG */ -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-09 17:14 ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu @ 2020-10-09 21:45 ` Jann Horn 2020-10-09 23:14 ` Kees Cook 1 sibling, 0 replies; 149+ messages in thread From: Jann Horn @ 2020-10-09 21:45 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > Currently the kernel does not provide an infrastructure to translate > architecture numbers to a human-readable name. Translating syscall > numbers to syscall names is possible through FTRACE_SYSCALL > infrastructure but it does not provide support for compat syscalls. > > This will create a file for each PID as /proc/pid/seccomp_cache. > The file will be empty when no seccomp filters are loaded, or be > in the format of: > <arch name> <decimal syscall number> <ALLOW | FILTER> > where ALLOW means the cache is guaranteed to allow the syscall, > and filter means the cache will pass the syscall to the BPF filter. > > For the docker default profile on x86_64 it looks like: > x86_64 0 ALLOW > x86_64 1 ALLOW > x86_64 2 ALLOW > x86_64 3 ALLOW > [...] > x86_64 132 ALLOW > x86_64 133 ALLOW > x86_64 134 FILTER > x86_64 135 FILTER > x86_64 136 FILTER > x86_64 137 ALLOW > x86_64 138 ALLOW > x86_64 139 FILTER > x86_64 140 ALLOW > x86_64 141 ALLOW > [...] > > This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default > of N because I think certain users of seccomp might not want the > application to know which syscalls are definitely usable. For > the same reason, it is also guarded by CAP_SYS_ADMIN. > > Suggested-by: Jann Horn <jannh@google.com> > Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> [...] > diff --git a/arch/Kconfig b/arch/Kconfig [...] > +config SECCOMP_CACHE_DEBUG > + bool "Show seccomp filter cache status in /proc/pid/seccomp_cache" > + depends on SECCOMP > + depends on SECCOMP_FILTER > + depends on PROC_FS > + help > + This is enables /proc/pid/seccomp_cache interface to monitor nit: s/This is enables/This enables the/ > + seccomp cache data. The file format is subject to change. Reading > + the file requires CAP_SYS_ADMIN. > + > + This option is for debugging only. Enabling present the risk that nit: *presents > + an adversary may be able to infer the seccomp filter logic. [...] > +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, > + struct pid *pid, struct task_struct *task) > +{ > + struct seccomp_filter *f; > + unsigned long flags; > + > + /* > + * We don't want some sandboxed process know what their seccomp s/know/to know/ > + * filters consist of. > + */ > + if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)) > + return -EACCES; > + > + if (!lock_task_sighand(task, &flags)) > + return 0; maybe return -ESRCH here so that userspace can distinguish between an exiting process and a process with no filters? > + f = READ_ONCE(task->seccomp.filter); > + if (!f) { > + unlock_task_sighand(task, &flags); > + return 0; > + } [...] ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-09 17:14 ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-10-09 21:45 ` Jann Horn @ 2020-10-09 23:14 ` Kees Cook 2020-10-10 13:26 ` YiFei Zhu 1 sibling, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-10-09 23:14 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 09, 2020 at 12:14:33PM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > Currently the kernel does not provide an infrastructure to translate > architecture numbers to a human-readable name. Translating syscall > numbers to syscall names is possible through FTRACE_SYSCALL > infrastructure but it does not provide support for compat syscalls. > > This will create a file for each PID as /proc/pid/seccomp_cache. > The file will be empty when no seccomp filters are loaded, or be > in the format of: > <arch name> <decimal syscall number> <ALLOW | FILTER> > where ALLOW means the cache is guaranteed to allow the syscall, > and filter means the cache will pass the syscall to the BPF filter. > > For the docker default profile on x86_64 it looks like: > x86_64 0 ALLOW > x86_64 1 ALLOW > x86_64 2 ALLOW > x86_64 3 ALLOW > [...] > x86_64 132 ALLOW > x86_64 133 ALLOW > x86_64 134 FILTER > x86_64 135 FILTER > x86_64 136 FILTER > x86_64 137 ALLOW > x86_64 138 ALLOW > x86_64 139 FILTER > x86_64 140 ALLOW > x86_64 141 ALLOW > [...] > > This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default > of N because I think certain users of seccomp might not want the > application to know which syscalls are definitely usable. For > the same reason, it is also guarded by CAP_SYS_ADMIN. > > Suggested-by: Jann Horn <jannh@google.com> > Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > --- > arch/Kconfig | 24 ++++++++++++++ > arch/x86/Kconfig | 1 + > arch/x86/include/asm/seccomp.h | 3 ++ > fs/proc/base.c | 6 ++++ > include/linux/seccomp.h | 5 +++ > kernel/seccomp.c | 59 ++++++++++++++++++++++++++++++++++ > 6 files changed, 98 insertions(+) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 21a3675a7a3a..85239a974f04 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -471,6 +471,15 @@ config HAVE_ARCH_SECCOMP_FILTER > results in the system call being skipped immediately. > - seccomp syscall wired up > > +config HAVE_ARCH_SECCOMP_CACHE > + bool > + help > + An arch should select this symbol if it provides all of these things: > + - all the requirements for HAVE_ARCH_SECCOMP_FILTER > + - SECCOMP_ARCH_NATIVE > + - SECCOMP_ARCH_NATIVE_NR > + - SECCOMP_ARCH_NATIVE_NAME > + > [...] > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 1ab22869a765..1a807f89ac77 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -150,6 +150,7 @@ config X86 > select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT > select HAVE_ARCH_PREL32_RELOCATIONS > select HAVE_ARCH_SECCOMP_FILTER > + select HAVE_ARCH_SECCOMP_CACHE > select HAVE_ARCH_THREAD_STRUCT_WHITELIST > select HAVE_ARCH_STACKLEAK > select HAVE_ARCH_TRACEHOOK HAVE_ARCH_SECCOMP_CACHE isn't used any more. I think this was left over from before. -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-09 23:14 ` Kees Cook @ 2020-10-10 13:26 ` YiFei Zhu 2020-10-12 22:57 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-10 13:26 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 9, 2020 at 6:14 PM Kees Cook <keescook@chromium.org> wrote: > HAVE_ARCH_SECCOMP_CACHE isn't used any more. I think this was left over > from before. Oh, I was meant to add this to the dependencies of SECCOMP_CACHE_DEBUG. Is this something that would make sense? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-10 13:26 ` YiFei Zhu @ 2020-10-12 22:57 ` Kees Cook 2020-10-13 0:31 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-10-12 22:57 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Sat, Oct 10, 2020 at 08:26:16AM -0500, YiFei Zhu wrote: > On Fri, Oct 9, 2020 at 6:14 PM Kees Cook <keescook@chromium.org> wrote: > > HAVE_ARCH_SECCOMP_CACHE isn't used any more. I think this was left over > > from before. > > Oh, I was meant to add this to the dependencies of > SECCOMP_CACHE_DEBUG. Is this something that would make sense? I think it's fine to just have this "dangle" with a help text update of "if seccomp action caching is supported by the architecture, provide the /proc/$pid ..." -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-12 22:57 ` Kees Cook @ 2020-10-13 0:31 ` YiFei Zhu 2020-10-22 20:52 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-13 0:31 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Mon, Oct 12, 2020 at 5:57 PM Kees Cook <keescook@chromium.org> wrote: > I think it's fine to just have this "dangle" with a help text update of > "if seccomp action caching is supported by the architecture, provide the > /proc/$pid ..." I think it would be weird if someone sees this help text and wonder... "hmm does my architecture support seccomp action caching" and without a clear pointer to how seccomp action cache works, goes and compiles the kernel with this config option on for the purpose of knowing if their arch supports it... Or, is it a common practice in the kernel to leave dangling configs? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-13 0:31 ` YiFei Zhu @ 2020-10-22 20:52 ` YiFei Zhu 2020-10-22 22:32 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-22 20:52 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Mon, Oct 12, 2020 at 7:31 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > On Mon, Oct 12, 2020 at 5:57 PM Kees Cook <keescook@chromium.org> wrote: > > I think it's fine to just have this "dangle" with a help text update of > > "if seccomp action caching is supported by the architecture, provide the > > /proc/$pid ..." > > I think it would be weird if someone sees this help text and wonder... > "hmm does my architecture support seccomp action caching" and without > a clear pointer to how seccomp action cache works, goes and compiles > the kernel with this config option on for the purpose of knowing if > their arch supports it... Or, is it a common practice in the kernel to > leave dangling configs? Bump, in case this question was missed. I don't really want to miss the 5.10 merge window... YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-22 20:52 ` YiFei Zhu @ 2020-10-22 22:32 ` Kees Cook 2020-10-22 23:40 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-10-22 22:32 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 22, 2020 at 03:52:20PM -0500, YiFei Zhu wrote: > On Mon, Oct 12, 2020 at 7:31 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > > > On Mon, Oct 12, 2020 at 5:57 PM Kees Cook <keescook@chromium.org> wrote: > > > I think it's fine to just have this "dangle" with a help text update of > > > "if seccomp action caching is supported by the architecture, provide the > > > /proc/$pid ..." > > > > I think it would be weird if someone sees this help text and wonder... > > "hmm does my architecture support seccomp action caching" and without > > a clear pointer to how seccomp action cache works, goes and compiles > > the kernel with this config option on for the purpose of knowing if > > their arch supports it... Or, is it a common practice in the kernel to > > leave dangling configs? > > Bump, in case this question was missed. I've been going back and forth on this, and I think what I've settled on is I'd like to avoid new CONFIG dependencies just for this feature. Instead, how about we just fill in SECCOMP_NATIVE and SECCOMP_COMPAT for all the HAVE_ARCH_SECCOMP_FILTER architectures, and then the cache reporting can be cleanly tied to CONFIG_SECCOMP_FILTER? It should be relatively simple to extract those details and make SECCOMP_ARCH_{NATIVE,COMPAT}_NAME part of the per-arch enabling patches? > I don't really want to miss the 5.10 merge window... Sorry, the 5.10 merge window is already closed for stuff that hasn't already been in -next. Most subsystem maintainers (myself included) don't take new features into their trees between roughly N-rc6 and (N+1)-rc1. My plan is to put this in my -next tree after -rc1 is released (expected to be Oct 25th). I'd still like to get more specific workload performance numbers too. The microbenchmark is nice, but getting things like build times under docker's default seccomp filter, etc would be lovely. I've almost gotten there, but my benchmarks are still really noisy and CPU isolation continues to frustrate me. :) -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-22 22:32 ` Kees Cook @ 2020-10-22 23:40 ` YiFei Zhu 2020-10-24 2:51 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-22 23:40 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 22, 2020 at 5:32 PM Kees Cook <keescook@chromium.org> wrote: > I've been going back and forth on this, and I think what I've settled > on is I'd like to avoid new CONFIG dependencies just for this feature. > Instead, how about we just fill in SECCOMP_NATIVE and SECCOMP_COMPAT > for all the HAVE_ARCH_SECCOMP_FILTER architectures, and then the > cache reporting can be cleanly tied to CONFIG_SECCOMP_FILTER? It > should be relatively simple to extract those details and make > SECCOMP_ARCH_{NATIVE,COMPAT}_NAME part of the per-arch enabling patches? Hmm. So I could enable the cache logic to every architecture (one patch per arch) that does not have the sparse syscall numbers, and then have the proc reporting after the arch patches? I could do that. I don't have test machines to run anything other than x86_64 or ia32, so they will need a closer look by people more familiar with those arches. > I'd still like to get more specific workload performance numbers too. > The microbenchmark is nice, but getting things like build times under > docker's default seccomp filter, etc would be lovely. I've almost gotten > there, but my benchmarks are still really noisy and CPU isolation > continues to frustrate me. :) Ok, let me know if I can help. YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-22 23:40 ` YiFei Zhu @ 2020-10-24 2:51 ` Kees Cook 2020-10-30 12:18 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-10-24 2:51 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Oct 22, 2020 at 06:40:08PM -0500, YiFei Zhu wrote: > On Thu, Oct 22, 2020 at 5:32 PM Kees Cook <keescook@chromium.org> wrote: > > I've been going back and forth on this, and I think what I've settled > > on is I'd like to avoid new CONFIG dependencies just for this feature. > > Instead, how about we just fill in SECCOMP_NATIVE and SECCOMP_COMPAT > > for all the HAVE_ARCH_SECCOMP_FILTER architectures, and then the > > cache reporting can be cleanly tied to CONFIG_SECCOMP_FILTER? It > > should be relatively simple to extract those details and make > > SECCOMP_ARCH_{NATIVE,COMPAT}_NAME part of the per-arch enabling patches? > > Hmm. So I could enable the cache logic to every architecture (one > patch per arch) that does not have the sparse syscall numbers, and > then have the proc reporting after the arch patches? I could do that. > I don't have test machines to run anything other than x86_64 or ia32, > so they will need a closer look by people more familiar with those > arches. Cool, yes please. It looks like MIPS will need to be skipped for now. I would have the debug cache reporting patch then depend on !CONFIG_HAVE_SPARSE_SYSCALL_NR. > > I'd still like to get more specific workload performance numbers too. > > The microbenchmark is nice, but getting things like build times under > > docker's default seccomp filter, etc would be lovely. I've almost gotten > > there, but my benchmarks are still really noisy and CPU isolation > > continues to frustrate me. :) > > Ok, let me know if I can help. Do you have a test environment where you can compare the before/after of repeated kernel build times (or some other sufficiently complex/interesting) workload under these conditions: bare metal docker w/ seccomp policy disabled docker w/ default seccomp policy This is what I've been trying to construct, but it's really noisy, so I've been trying to pin CPUs and NUMA memory nodes, but it's not really helping yet. :P -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-24 2:51 ` Kees Cook @ 2020-10-30 12:18 ` YiFei Zhu 2020-11-03 13:00 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-30 12:18 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 23, 2020 at 9:51 PM Kees Cook <keescook@chromium.org> wrote: > Do you have a test environment where you can compare the before/after > of repeated kernel build times (or some other sufficiently > complex/interesting) workload under these conditions: > > bare metal > docker w/ seccomp policy disabled > docker w/ default seccomp policy > > This is what I've been trying to construct, but it's really noisy, so > I've been trying to pin CPUs and NUMA memory nodes, but it's not really > helping yet. :P Hi, sorry for the delay. The benchmarks took a while to get. I got a bare metal test machine with Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz, running Ubuntu 18.04. Test kernels are compiled at 57a339117e52 ("selftests/seccomp: Compare bitmap vs filter overhead") and 3650b228f83a ("Linux 5.10-rc1"), built with Ubuntu's 5.3.0-64-generic's config, then `make olddefconfig`. "Mitigations off" indicate the kernel was booted with "nospectre_v2 nospectre_v1 no_stf_barrier tsx=off tsx_async_abort=off". The benchmark was single-job make on x86_64 defconfig of 5.9.1, with CPU affinity to set only processor #0. Raw results are appended below. Each boot is tested by running the build directly and inside docker, with and without seccomp. The commands used are attached below. Each test is 4 trials, with the middle two (non-minimum, non-maximum) wall clock time averaged. Results summary: Mitigations On Mitigations Off With Cache Without Cache With Cache Without Cache Native 18:17.38 18:13.78 18:16.08 18:15.67 D. no seccomp 18:15.54 18:17.71 18:17.58 18:16.75 D. + seccomp 20:42.47 20:45.04 18:47.67 18:49.01 To be honest, I'm somewhat surprised that it didn't produce as much of a dent in the seccomp overhead in this macro benchmark as I had expected. Below are commands used and outputs from time command. Commands used to start the docker containers: docker run -w /srv/yifeifz2/linux-buildtest \ --tmpfs /srv/yifeifz2/linux-buildtest:exec --rm -it ubuntu:18.04 docker run -w /srv/yifeifz2/linux-buildtest \ --tmpfs /srv/yifeifz2/linux-buildtest:exec --rm -it \ --security-opt seccomp=unconfined ubuntu:18.04 Commands used to install the toolchain inside docker: apt -y update apt -y dist-upgrade apt -y install build-essential wget flex bison time libssl-dev bc libelf-dev Commands to benchmark on native: for i in {1..4}; do mkdir -p /srv/yifeifz2/linux-buildtest mount -t tmpfs tmpfs /srv/yifeifz2/linux-buildtest pushd /srv/yifeifz2/linux-buildtest wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.9.1.tar.xz tar xf linux-5.9.1.tar.xz cd linux-5.9.1 make mrproper make defconfig taskset 0x1 time make -j1 > /dev/null popd umount /srv/yifeifz2/linux-buildtest done Commands to benchmark inside docker: for i in {1..4}; do wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.9.1.tar.xz tar xf linux-5.9.1.tar.xz pushd linux-5.9.1 make mrproper make defconfig taskset 0x1 time make -j1 > /dev/null popd rm -rf linux-5.9.1 linux-5.9.1.tar.xz done ==== with cache, mitigations on ==== 973.52user 113.98system 18:16.51elapsed 99%CPU (0avgtext+0avgdata 239784maxresident)k 0inputs+217152outputs (0major+51937662minor)pagefaults 0swaps 973.74user 115.35system 18:18.41elapsed 99%CPU (0avgtext+0avgdata 239640maxresident)k 0inputs+217152outputs (0major+51933865minor)pagefaults 0swaps 973.31user 114.41system 18:17.37elapsed 99%CPU (0avgtext+0avgdata 239660maxresident)k 72inputs+217152outputs (0major+51936343minor)pagefaults 0swaps 971.76user 116.04system 18:17.39elapsed 99%CPU (0avgtext+0avgdata 239588maxresident)k 0inputs+217152outputs (0major+51936222minor)pagefaults 0swaps 961.44user 121.30system 18:15.30elapsed 98%CPU (0avgtext+0avgdata 239580maxresident)k 0inputs+217152outputs (0major+51555371minor)pagefaults 0swaps 961.86user 119.48system 18:13.96elapsed 98%CPU (0avgtext+0avgdata 239480maxresident)k 0inputs+217152outputs (0major+51552153minor)pagefaults 0swaps 961.68user 121.75system 18:15.78elapsed 98%CPU (0avgtext+0avgdata 239504maxresident)k 0inputs+217152outputs (0major+51559201minor)pagefaults 0swaps 960.80user 122.04system 18:18.99elapsed 98%CPU (0avgtext+0avgdata 239644maxresident)k 0inputs+217152outputs (0major+51557386minor)pagefaults 0swaps 1104.08user 124.48system 20:42.13elapsed 98%CPU (0avgtext+0avgdata 239544maxresident)k 984inputs+217152outputs (21major+51552022minor)pagefaults 0swaps 1101.78user 125.66system 20:40.80elapsed 98%CPU (0avgtext+0avgdata 239692maxresident)k 0inputs+217152outputs (0major+51546446minor)pagefaults 0swaps 1102.98user 126.03system 20:43.09elapsed 98%CPU (0avgtext+0avgdata 239592maxresident)k 0inputs+217152outputs (0major+51551238minor)pagefaults 0swaps 1103.34user 125.32system 20:42.82elapsed 98%CPU (0avgtext+0avgdata 239620maxresident)k 0inputs+217152outputs (0major+51554493minor)pagefaults 0swaps ==== without cache, mitigations on ==== 967.19user 115.77system 18:17.20elapsed 98%CPU (0avgtext+0avgdata 239536maxresident)k 25112inputs+217152outputs (166major+51935958minor)pagefaults 0swaps 969.05user 114.18system 18:12.92elapsed 99%CPU (0avgtext+0avgdata 239544maxresident)k 0inputs+217152outputs (0major+51938961minor)pagefaults 0swaps 968.51user 116.50system 18:14.64elapsed 99%CPU (0avgtext+0avgdata 239716maxresident)k 0inputs+217152outputs (0major+51937686minor)pagefaults 0swaps 968.53user 115.13system 18:10.33elapsed 99%CPU (0avgtext+0avgdata 239628maxresident)k 0inputs+217152outputs (0major+51938033minor)pagefaults 0swaps 962.85user 121.56system 18:17.73elapsed 98%CPU (0avgtext+0avgdata 239736maxresident)k 0inputs+217152outputs (0major+51549715minor)pagefaults 0swaps 962.51user 121.74system 18:17.42elapsed 98%CPU (0avgtext+0avgdata 239480maxresident)k 0inputs+217152outputs (0major+51558249minor)pagefaults 0swaps 963.37user 121.24system 18:18.59elapsed 98%CPU (0avgtext+0avgdata 239224maxresident)k 0inputs+217152outputs (0major+51551031minor)pagefaults 0swaps 963.71user 120.75system 18:17.70elapsed 98%CPU (0avgtext+0avgdata 239460maxresident)k 0inputs+217152outputs (0major+51555583minor)pagefaults 0swaps 1103.35user 126.49system 20:45.59elapsed 98%CPU (0avgtext+0avgdata 239600maxresident)k 984inputs+217152outputs (21major+51557916minor)pagefaults 0swaps 1103.01user 126.69system 20:45.36elapsed 98%CPU (0avgtext+0avgdata 239708maxresident)k 232inputs+217152outputs (0major+51560311minor)pagefaults 0swaps 1102.97user 127.13system 20:44.73elapsed 98%CPU (0avgtext+0avgdata 239440maxresident)k 0inputs+217152outputs (0major+51552998minor)pagefaults 0swaps 1103.09user 127.01system 20:44.48elapsed 98%CPU (0avgtext+0avgdata 239448maxresident)k 0inputs+217152outputs (0major+51559328minor)pagefaults 0swaps ==== with cache, mitigations off ==== 971.35user 114.45system 18:16.36elapsed 99%CPU (0avgtext+0avgdata 239740maxresident)k 1584inputs+217152outputs (10major+51937572minor)pagefaults 0swaps 971.75user 115.18system 18:16.04elapsed 99%CPU (0avgtext+0avgdata 239648maxresident)k 0inputs+217152outputs (0major+51944016minor)pagefaults 0swaps 972.03user 114.47system 18:16.12elapsed 99%CPU (0avgtext+0avgdata 239368maxresident)k 744inputs+217152outputs (0major+51946745minor)pagefaults 0swaps 970.59user 115.13system 18:15.21elapsed 99%CPU (0avgtext+0avgdata 239736maxresident)k 0inputs+217152outputs (1major+51936971minor)pagefaults 0swaps 964.13user 121.15system 18:17.44elapsed 98%CPU (0avgtext+0avgdata 239496maxresident)k 0inputs+217152outputs (0major+51554855minor)pagefaults 0swaps 964.46user 120.73system 18:16.89elapsed 98%CPU (0avgtext+0avgdata 239492maxresident)k 0inputs+217152outputs (0major+51563668minor)pagefaults 0swaps 964.00user 121.71system 18:18.42elapsed 98%CPU (0avgtext+0avgdata 239504maxresident)k 0inputs+217152outputs (0major+51549101minor)pagefaults 0swaps 963.99user 121.46system 18:17.72elapsed 98%CPU (0avgtext+0avgdata 239644maxresident)k 0inputs+217152outputs (0major+51561705minor)pagefaults 0swaps 993.01user 123.83system 18:47.73elapsed 99%CPU (0avgtext+0avgdata 239648maxresident)k 984inputs+217152outputs (21major+51554203minor)pagefaults 0swaps 991.53user 125.49system 18:47.28elapsed 99%CPU (0avgtext+0avgdata 239380maxresident)k 0inputs+217152outputs (0major+51557014minor)pagefaults 0swaps 992.52user 124.53system 18:47.61elapsed 99%CPU (0avgtext+0avgdata 239344maxresident)k 0inputs+217152outputs (0major+51555681minor)pagefaults 0swaps 993.47user 125.01system 18:48.98elapsed 99%CPU (0avgtext+0avgdata 239448maxresident)k 0inputs+217152outputs (0major+51558830minor)pagefaults 0swaps ==== without cache, mitigations off ==== 969.87user 118.18system 18:16.82elapsed 99%CPU (0avgtext+0avgdata 239640maxresident)k 0inputs+217152outputs (0major+51937042minor)pagefaults 0swaps 971.42user 114.62system 18:14.93elapsed 99%CPU (0avgtext+0avgdata 239840maxresident)k 0inputs+217152outputs (0major+51937617minor)pagefaults 0swaps 971.73user 114.40system 18:15.39elapsed 99%CPU (0avgtext+0avgdata 239724maxresident)k 0inputs+217152outputs (0major+51937768minor)pagefaults 0swaps 969.71user 117.13system 18:15.95elapsed 99%CPU (0avgtext+0avgdata 239680maxresident)k 0inputs+217152outputs (0major+51940505minor)pagefaults 0swaps 963.51user 121.32system 18:16.91elapsed 98%CPU (0avgtext+0avgdata 239516maxresident)k 0inputs+217152outputs (0major+51561337minor)pagefaults 0swaps 963.10user 120.75system 18:17.34elapsed 98%CPU (0avgtext+0avgdata 239464maxresident)k 0inputs+217152outputs (0major+51547338minor)pagefaults 0swaps 962.27user 122.48system 18:16.59elapsed 98%CPU (0avgtext+0avgdata 239544maxresident)k 0inputs+217152outputs (0major+51552060minor)pagefaults 0swaps 962.83user 120.21system 18:15.37elapsed 98%CPU (0avgtext+0avgdata 239496maxresident)k 0inputs+217152outputs (0major+51553345minor)pagefaults 0swaps 990.69user 125.78system 18:48.93elapsed 98%CPU (0avgtext+0avgdata 239440maxresident)k 984inputs+217152outputs (21major+51558142minor)pagefaults 0swaps 990.76user 126.01system 18:48.88elapsed 98%CPU (0avgtext+0avgdata 239800maxresident)k 0inputs+217152outputs (0major+51558483minor)pagefaults 0swaps 991.06user 125.99system 18:49.30elapsed 98%CPU (0avgtext+0avgdata 239412maxresident)k 0inputs+217152outputs (0major+51556462minor)pagefaults 0swaps 992.33user 124.77system 18:49.09elapsed 98%CPU (0avgtext+0avgdata 239684maxresident)k 0inputs+217152outputs (0major+51549745minor)pagefaults 0swaps YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-30 12:18 ` YiFei Zhu @ 2020-11-03 13:00 ` YiFei Zhu 2020-11-04 0:29 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-11-03 13:00 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Fri, Oct 30, 2020 at 7:18 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > I got a bare metal test machine with Intel(R) Xeon(R) CPU E5-2660 v3 @ > 2.60GHz, running Ubuntu 18.04. Test kernels are compiled at > 57a339117e52 ("selftests/seccomp: Compare bitmap vs filter overhead") > and 3650b228f83a ("Linux 5.10-rc1"), built with Ubuntu's > 5.3.0-64-generic's config, then `make olddefconfig`. "Mitigations off" > indicate the kernel was booted with "nospectre_v2 nospectre_v1 > no_stf_barrier tsx=off tsx_async_abort=off". > > The benchmark was single-job make on x86_64 defconfig of 5.9.1, with > CPU affinity to set only processor #0. Raw results are appended below. > Each boot is tested by running the build directly and inside docker, > with and without seccomp. The commands used are attached below. Each > test is 4 trials, with the middle two (non-minimum, non-maximum) wall > clock time averaged. Results summary: > > Mitigations On Mitigations Off > With Cache Without Cache With Cache Without Cache > Native 18:17.38 18:13.78 18:16.08 18:15.67 > D. no seccomp 18:15.54 18:17.71 18:17.58 18:16.75 > D. + seccomp 20:42.47 20:45.04 18:47.67 18:49.01 > > To be honest, I'm somewhat surprised that it didn't produce as much of > a dent in the seccomp overhead in this macro benchmark as I had > expected. My peers pointed out that in my previous benchmark there are still a few mitigations left on, and suggested to use "noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off tsx=on tsx_async_abort=off mitigations=off". Results with "Mitigations Off" updated: Mitigations On Mitigations Off With Cache Without Cache With Cache Without Cache Native 18:17.38 18:13.78 17:43.42 17:47.68 D. no seccomp 18:15.54 18:17.71 17:34.59 17:37.54 D. + seccomp 20:42.47 20:45.04 17:35.70 17:37.16 Whether seccomp is on or off seems not to make much of a difference for this benchmark. Bitmap being enabled does seem to decrease the overall compilation time but it also affects where seccomp is off, so the speedup is probably from other factors. We are thinking about using more syscall-intensive workloads, such as httpd. Thugh, this does make me wonder, where does the 3-minute overhead with seccomp with mitigations come from? Is it data cache misses? If that is the case, can we somehow preload the seccomp bitmap cache maybe? I mean, mitigations only cause around half a minute slowdown without seccomp but seccomp somehow amplify the slowdown with an additional 2.5 minutes, so something must be off here. This is the raw output for the time commands: ==== with cache, mitigations off ==== 947.02user 108.62system 17:47.65elapsed 98%CPU (0avgtext+0avgdata 239804maxresident)k 25112inputs+217152outputs (166major+51934447minor)pagefaults 0swaps 947.91user 108.20system 17:46.53elapsed 99%CPU (0avgtext+0avgdata 239576maxresident)k 0inputs+217152outputs (0major+51941524minor)pagefaults 0swaps 948.33user 108.70system 17:47.72elapsed 98%CPU (0avgtext+0avgdata 239604maxresident)k 0inputs+217152outputs (0major+51938566minor)pagefaults 0swaps 948.65user 108.81system 17:48.41elapsed 98%CPU (0avgtext+0avgdata 239692maxresident)k 0inputs+217152outputs (0major+51935349minor)pagefaults 0swaps 932.12user 113.68system 17:37.24elapsed 98%CPU (0avgtext+0avgdata 239660maxresident)k 0inputs+217152outputs (0major+51547571minor)pagefaults 0swap 931.69user 114.12system 17:37.84elapsed 98%CPU (0avgtext+0avgdata 239448maxresident)k 0inputs+217152outputs (0major+51539964minor)pagefaults 0swaps 932.25user 113.39system 17:37.75elapsed 98%CPU (0avgtext+0avgdata 239372maxresident)k 0inputs+217152outputs (0major+51538018minor)pagefaults 0swaps 931.09user 114.25system 17:37.34elapsed 98%CPU (0avgtext+0avgdata 239508maxresident)k 0inputs+217152outputs (0major+51537700minor)pagefaults 0swaps 929.96user 113.42system 17:36.23elapsed 98%CPU (0avgtext+0avgdata 239448maxresident)k 984inputs+217152outputs (22major+51544059minor)pagefaults 0swaps 929.73user 115.13system 17:38.09elapsed 98%CPU (0avgtext+0avgdata 239464maxresident)k 0inputs+217152outputs (0major+51540259minor)pagefaults 0swaps 930.13user 112.71system 17:36.17elapsed 98%CPU (0avgtext+0avgdata 239620maxresident)k 0inputs+217152outputs (0major+51540623minor)pagefaults 0swaps 930.57user 113.02system 17:49.70elapsed 97%CPU (0avgtext+0avgdata 239432maxresident)k 0inputs+217152outputs (0major+51537776minor)pagefaults 0swaps ==== without cache, mitigations off ==== 947.59user 108.06system 17:44.56elapsed 99%CPU (0avgtext+0avgdata 239484maxresident)k 25112inputs+217152outputs (167major+51938723minor)pagefaults 0swaps 947.95user 108.58system 17:43.40elapsed 99%CPU (0avgtext+0avgdata 239580maxresident)k 0inputs+217152outputs (0major+51943434minor)pagefaults 0swaps 948.54user 106.62system 17:42.39elapsed 99%CPU (0avgtext+0avgdata 239608maxresident)k 0inputs+217152outputs (0major+51936408minor)pagefaults 0swaps 947.85user 107.92system 17:43.44elapsed 99%CPU (0avgtext+0avgdata 239656maxresident)k 0inputs+217152outputs (0major+51931633minor)pagefaults 0swaps 931.28user 111.16system 17:33.59elapsed 98%CPU (0avgtext+0avgdata 239440maxresident)k 0inputs+217152outputs (0major+51543540minor)pagefaults 0swaps 930.21user 112.56system 17:34.20elapsed 98%CPU (0avgtext+0avgdata 239400maxresident)k 0inputs+217152outputs (0major+51539699minor)pagefaults 0swaps 930.16user 113.74system 17:35.06elapsed 98%CPU (0avgtext+0avgdata 239344maxresident)k 0inputs+217152outputs (0major+51543072minor)pagefaults 0swaps 930.17user 112.77system 17:34.98elapsed 98%CPU (0avgtext+0avgdata 239176maxresident)k 0inputs+217152outputs (0major+51540777minor)pagefaults 0swaps 931.92user 113.31system 17:36.05elapsed 98%CPU (0avgtext+0avgdata 239520maxresident)k 984inputs+217152outputs (22major+51534636minor)pagefaults 0swaps 931.14user 112.81system 17:35.35elapsed 98%CPU (0avgtext+0avgdata 239524maxresident)k 0inputs+217152outputs (0major+51549007minor)pagefaults 0swaps 930.93user 114.56system 17:37.72elapsed 98%CPU (0avgtext+0avgdata 239360maxresident)k 0inputs+217152outputs (0major+51542191minor)pagefaults 0swaps 932.26user 111.54system 17:35.36elapsed 98%CPU (0avgtext+0avgdata 239572maxresident)k 0inputs+217152outputs (0major+51537921minor)pagefaults 0swaps YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-11-03 13:00 ` YiFei Zhu @ 2020-11-04 0:29 ` Kees Cook 2020-11-04 11:40 ` YiFei Zhu 0 siblings, 1 reply; 149+ messages in thread From: Kees Cook @ 2020-11-04 0:29 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Tue, Nov 03, 2020 at 07:00:22AM -0600, YiFei Zhu wrote: > My peers pointed out that in my previous benchmark there are still a > few mitigations left on, and suggested to use "noibrs noibpb nopti > nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable > no_stf_barrier mds=off tsx=on tsx_async_abort=off mitigations=off". > Results with "Mitigations Off" updated: > > Mitigations On Mitigations Off > With Cache Without Cache With Cache Without Cache > Native 18:17.38 18:13.78 17:43.42 17:47.68 > D. no seccomp 18:15.54 18:17.71 17:34.59 17:37.54 > D. + seccomp 20:42.47 20:45.04 17:35.70 17:37.16 > > Whether seccomp is on or off seems not to make much of a difference > for this benchmark. Bitmap being enabled does seem to decrease the > overall compilation time but it also affects where seccomp is off, so > the speedup is probably from other factors. We are thinking about > using more syscall-intensive workloads, such as httpd. Yeah, this is very interesting. That there is anything measurably _slower_ with the cache is surprising. Though with only 4 runs, I wonder if it's still noisy? What happens at 10 runs -- more importantly what is the standard deviation? > Thugh, this does make me wonder, where does the 3-minute overhead with > seccomp with mitigations come from? Is it data cache misses? If that > is the case, can we somehow preload the seccomp bitmap cache maybe? I > mean, mitigations only cause around half a minute slowdown without > seccomp but seccomp somehow amplify the slowdown with an additional > 2.5 minutes, so something must be off here. I assume this is from Indirect Branch Prediction Barrier (IBPB) and Single Threaded Indirect Branch Prediction (STIBP) (which get enabled for threads under seccomp by default). Try booting with "spectre_v2_user=prctl" https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/spectre.html#spectre-mitigation-control-command-line -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-11-04 0:29 ` Kees Cook @ 2020-11-04 11:40 ` YiFei Zhu 2020-11-04 18:57 ` Kees Cook 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-11-04 11:40 UTC (permalink / raw) To: Kees Cook Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Tue, Nov 3, 2020 at 6:29 PM Kees Cook <keescook@chromium.org> wrote: > Yeah, this is very interesting. That there is anything measurably _slower_ > with the cache is surprising. Though with only 4 runs, I wonder if it's > still noisy? What happens at 10 runs -- more importantly what is the > standard deviation? I could do that. it just takes such a long time. Each run takes about 20 minutes so with 10 runs per environment, 3 environments (native + 2 docker) per boot, and 4 boots (2 bootparam * 2 compile config), it's 27 hours of compilation. I should probably script it at that point. > I assume this is from Indirect Branch Prediction Barrier (IBPB) and > Single Threaded Indirect Branch Prediction (STIBP) (which get enabled > for threads under seccomp by default). > > Try booting with "spectre_v2_user=prctl" Hmm, to make sure, boot with just "spectre_v2_user=prctl" on the command line and test the performance of that? YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-11-04 11:40 ` YiFei Zhu @ 2020-11-04 18:57 ` Kees Cook 0 siblings, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-11-04 18:57 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Wed, Nov 04, 2020 at 05:40:51AM -0600, YiFei Zhu wrote: > On Tue, Nov 3, 2020 at 6:29 PM Kees Cook <keescook@chromium.org> wrote: > > Yeah, this is very interesting. That there is anything measurably _slower_ > > with the cache is surprising. Though with only 4 runs, I wonder if it's > > still noisy? What happens at 10 runs -- more importantly what is the > > standard deviation? > > I could do that. it just takes such a long time. Each run takes about > 20 minutes so with 10 runs per environment, 3 environments (native + 2 > docker) per boot, and 4 boots (2 bootparam * 2 compile config), it's > 27 hours of compilation. I should probably script it at that point. Yeah, I was facing the same issues. Though perhaps hackbench (with multiple CPUs) would be a better test (and it's much faster): https://lore.kernel.org/lkml/7723ae8d-8333-ba17-6983-a45ec8b11c54@redhat.com/ (I usually run this with a CNT of 20 to get quick results.) > > I assume this is from Indirect Branch Prediction Barrier (IBPB) and > > Single Threaded Indirect Branch Prediction (STIBP) (which get enabled > > for threads under seccomp by default). > > > > Try booting with "spectre_v2_user=prctl" > > Hmm, to make sure, boot with just "spectre_v2_user=prctl" on the > command line and test the performance of that? Right, see if that eliminates the 3 minute jump seen for seccomp. -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results 2020-10-09 17:14 ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu ` (4 preceding siblings ...) 2020-10-09 17:14 ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu @ 2020-10-11 15:47 ` YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu ` (5 more replies) 5 siblings, 6 replies; 149+ messages in thread From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/ Major differences from the linked alternative by Kees: * No x32 special-case handling -- not worth the complexity * No caching of denylist -- not worth the complexity * No seccomp arch pinning -- I think this is an independent feature * The bitmaps are part of the filters rather than the task. This series adds a bitmap to cache seccomp filter results if the result permits a syscall and is indepenent of syscall arguments. This visibly decreases seccomp overhead for most common seccomp filters with very little memory footprint. The overhead of running Seccomp filters has been part of some past discussions [1][2][3]. Oftentimes, the filters have a large number of instructions that check syscall numbers one by one and jump based on that. Some users chain BPF filters which further enlarge the overhead. A recent work [6] comprehensively measures the Seccomp overhead and shows that the overhead is non-negligible and has a non-trivial impact on application performance. We observed some common filters, such as docker's [4] or systemd's [5], will make most decisions based only on the syscall numbers, and as past discussions considered, a bitmap where each bit represents a syscall makes most sense for these filters. In order to build this bitmap at filter attach time, each filter is emulated for every syscall (under each possible architecture), and checked for any accesses of struct seccomp_data that are not the "arch" nor "nr" (syscall) members. If only "arch" and "nr" are examined, and the program returns allow, then we can be sure that the filter must return allow independent from syscall arguments. When it is concluded that an allow must occur for the given architecture and syscall pair, seccomp will immediately allow the syscall, bypassing further BPF execution. Ongoing work is to further support arguments with fast hash table lookups. We are investigating the performance of doing so [6], and how to best integrate with the existing seccomp infrastructure. Some benchmarks are performed with results in patch 5, copied below: Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 200000000 syscalls... 129.359381409 - 0.008724424 = 129350656985 (129.4s) getpid native: 646 ns 264.385890006 - 129.360453229 = 135025436777 (135.0s) getpid RET_ALLOW 1 filter (bitmap): 675 ns 399.400511893 - 264.387045901 = 135013465992 (135.0s) getpid RET_ALLOW 2 filters (bitmap): 675 ns 545.872866260 - 399.401718327 = 146471147933 (146.5s) getpid RET_ALLOW 3 filters (full): 732 ns 696.337101319 - 545.874097681 = 150463003638 (150.5s) getpid RET_ALLOW 4 filters (full): 752 ns Estimated total seccomp overhead for 1 bitmapped filter: 29 ns Estimated total seccomp overhead for 2 bitmapped filters: 29 ns Estimated total seccomp overhead for 3 full filters: 86 ns Estimated total seccomp overhead for 4 full filters: 106 ns Estimated seccomp entry overhead: 29 ns Estimated seccomp per-filter overhead (last 2 diff): 20 ns Estimated seccomp per-filter overhead (filters / 4): 19 ns Expectations: native ≤ 1 bitmap (646 ≤ 675): ✔️ native ≤ 1 filter (646 ≤ 732): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️ 1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️ entry ≈ 1 bitmapped (29 ≈ 29): ✔️ entry ≈ 2 bitmapped (29 ≈ 29): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️ v4 -> v5: * Typo and wording fixes * Skip arch number test when there are only one arch * Fixed prog instruction number check. * Added comment about the behavior of x32. * /proc/pid/seccomp_cache return -ESRCH for exiting process. * Fixed /proc/pid/seccomp_cache depend on the architecture. * Fixed struct seq_file visibility reported by kernel test robot. v3 -> v4: * Reordered patches * Naming changes * Fixed racing in /proc/pid/seccomp_cache against filter being released from task, using Jann's suggestion of sighand spinlock. * Cache no longer configurable. * Copied some description from cover letter to commit messages. * Used Kees's logic to set clear bits from bitmap, rather than set bits. v2 -> v3: * Added array_index_nospec guards * No more syscall_arches[] array and expecting on loop unrolling. Arches are configured with per-arch seccomp.h. * Moved filter emulation to attach time (from prepare time). * Further simplified emulator, basing on Kees's code. * Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN. v1 -> v2: * Corrected one outdated function documentation. RFC -> v1: * Config made on by default across all arches that could support it. * Added arch numbers array and emulate filter for each arch number, and have a per-arch bitmap. * Massively simplified the emulator so it would only support the common instructions in Kees's list. * Fixed inheriting bitmap across filters (filter->prev is always NULL during prepare). * Stole the selftest from Kees. * Added a /proc/pid/seccomp_cache by Jann's suggestion. Patch 1 implements the test_bit against the bitmaps. Patch 2 implements the emulator that finds if a filter must return allow, Patch 3 adds the arch macros for x86. Patch 4 updates the selftest to better show the new semantics. Patch 5 implements /proc/pid/seccomp_cache. [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ [3] https://github.com/seccomp/libseccomp/issues/116 [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 [6] Draco: Architectural and Operating System Support for System Call Security https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 Kees Cook (2): x86: Enable seccomp architecture tracking selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu (3): seccomp/cache: Lookup syscall allowlist bitmap for fast path seccomp/cache: Add "emulator" to check if filter is constant allow seccomp/cache: Report cache data through /proc/pid/seccomp_cache arch/Kconfig | 24 ++ arch/x86/Kconfig | 1 + arch/x86/include/asm/seccomp.h | 20 ++ fs/proc/base.c | 6 + include/linux/seccomp.h | 7 + kernel/seccomp.c | 292 +++++++++++++++++- .../selftests/seccomp/seccomp_benchmark.c | 151 +++++++-- tools/testing/selftests/seccomp/settings | 2 +- 8 files changed, 479 insertions(+), 24 deletions(-) -- 2.28.0 ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path 2020-10-11 15:47 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu @ 2020-10-11 15:47 ` YiFei Zhu 2020-10-12 6:42 ` Jann Horn 2020-10-11 15:47 ` [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu ` (4 subsequent siblings) 5 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> The overhead of running Seccomp filters has been part of some past discussions [1][2][3]. Oftentimes, the filters have a large number of instructions that check syscall numbers one by one and jump based on that. Some users chain BPF filters which further enlarge the overhead. A recent work [6] comprehensively measures the Seccomp overhead and shows that the overhead is non-negligible and has a non-trivial impact on application performance. We observed some common filters, such as docker's [4] or systemd's [5], will make most decisions based only on the syscall numbers, and as past discussions considered, a bitmap where each bit represents a syscall makes most sense for these filters. The fast (common) path for seccomp should be that the filter permits the syscall to pass through, and failing seccomp is expected to be an exceptional case; it is not expected for userspace to call a denylisted syscall over and over. When it can be concluded that an allow must occur for the given architecture and syscall pair (this determination is introduced in the next commit), seccomp will immediately allow the syscall, bypassing further BPF execution. Each architecture number has its own bitmap. The architecture number in seccomp_data is checked against the defined architecture number constant before proceeding to test the bit against the bitmap with the syscall number as the index of the bit in the bitmap, and if the bit is set, seccomp returns allow. The bitmaps are all clear in this patch and will be initialized in the next commit. When only one architecture exists, the check against architecture number is skipped, suggested by Kees Cook [7]. [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ [3] https://github.com/seccomp/libseccomp/issues/116 [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 [6] Draco: Architectural and Operating System Support for System Call Security https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 [7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/ Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- kernel/seccomp.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 77 insertions(+) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index ae6b40cc39f4..d67a8b61f2bf 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -143,6 +143,34 @@ struct notification { struct list_head notifications; }; +#ifdef SECCOMP_ARCH_NATIVE +/** + * struct action_cache - per-filter cache of seccomp actions per + * arch/syscall pair + * + * @allow_native: A bitmap where each bit represents whether the + * filter will always allow the syscall, for the + * native architecture. + * @allow_compat: A bitmap where each bit represents whether the + * filter will always allow the syscall, for the + * compat architecture. + */ +struct action_cache { + DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR); +#ifdef SECCOMP_ARCH_COMPAT + DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR); +#endif +}; +#else +struct action_cache { }; + +static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + return false; +} +#endif /* SECCOMP_ARCH_NATIVE */ + /** * struct seccomp_filter - container for seccomp BPF programs * @@ -298,6 +326,52 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) return 0; } +#ifdef SECCOMP_ARCH_NATIVE +static inline bool seccomp_cache_check_allow_bitmap(const void *bitmap, + size_t bitmap_size, + int syscall_nr) +{ + if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size)) + return false; + syscall_nr = array_index_nospec(syscall_nr, bitmap_size); + + return test_bit(syscall_nr, bitmap); +} + +/** + * seccomp_cache_check_allow - lookup seccomp cache + * @sfilter: The seccomp filter + * @sd: The seccomp data to lookup the cache with + * + * Returns true if the seccomp_data is cached and allowed. + */ +static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + int syscall_nr = sd->nr; + const struct action_cache *cache = &sfilter->cache; + +#ifndef SECCOMP_ARCH_COMPAT + /* A native-only architecture doesn't need to check sd->arch. */ + return seccomp_cache_check_allow_bitmap(cache->allow_native, + SECCOMP_ARCH_NATIVE_NR, + syscall_nr); +#else + if (likely(sd->arch == SECCOMP_ARCH_NATIVE)) + return seccomp_cache_check_allow_bitmap(cache->allow_native, + SECCOMP_ARCH_NATIVE_NR, + syscall_nr); + if (likely(sd->arch == SECCOMP_ARCH_COMPAT)) + return seccomp_cache_check_allow_bitmap(cache->allow_compat, + SECCOMP_ARCH_COMPAT_NR, + syscall_nr); +#endif /* SECCOMP_ARCH_COMPAT */ + + WARN_ON_ONCE(true); + return false; +} +#endif /* SECCOMP_ARCH_NATIVE */ + /** * seccomp_run_filters - evaluates all seccomp filters against @sd * @sd: optional seccomp data to be passed to filters @@ -320,6 +394,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, if (WARN_ON(f == NULL)) return SECCOMP_RET_KILL_PROCESS; + if (seccomp_cache_check_allow(f, sd)) + return SECCOMP_RET_ALLOW; + /* * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path 2020-10-11 15:47 ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu @ 2020-10-12 6:42 ` Jann Horn 0 siblings, 0 replies; 149+ messages in thread From: Jann Horn @ 2020-10-12 6:42 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Sun, Oct 11, 2020 at 5:48 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > The overhead of running Seccomp filters has been part of some past > discussions [1][2][3]. Oftentimes, the filters have a large number > of instructions that check syscall numbers one by one and jump based > on that. Some users chain BPF filters which further enlarge the > overhead. A recent work [6] comprehensively measures the Seccomp > overhead and shows that the overhead is non-negligible and has a > non-trivial impact on application performance. > > We observed some common filters, such as docker's [4] or > systemd's [5], will make most decisions based only on the syscall > numbers, and as past discussions considered, a bitmap where each bit > represents a syscall makes most sense for these filters. > > The fast (common) path for seccomp should be that the filter permits > the syscall to pass through, and failing seccomp is expected to be > an exceptional case; it is not expected for userspace to call a > denylisted syscall over and over. > > When it can be concluded that an allow must occur for the given > architecture and syscall pair (this determination is introduced in > the next commit), seccomp will immediately allow the syscall, > bypassing further BPF execution. > > Each architecture number has its own bitmap. The architecture > number in seccomp_data is checked against the defined architecture > number constant before proceeding to test the bit against the > bitmap with the syscall number as the index of the bit in the > bitmap, and if the bit is set, seccomp returns allow. The bitmaps > are all clear in this patch and will be initialized in the next > commit. > > When only one architecture exists, the check against architecture > number is skipped, suggested by Kees Cook [7]. > > [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ > [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ > [3] https://github.com/seccomp/libseccomp/issues/116 > [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json > [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 > [6] Draco: Architectural and Operating System Support for System Call Security > https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 > [7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/ > > Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> Reviewed-by: Jann Horn <jannh@google.com> ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-10-11 15:47 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu @ 2020-10-11 15:47 ` YiFei Zhu 2020-10-12 6:46 ` Jann Horn 2020-10-11 15:47 ` [PATCH v5 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu ` (3 subsequent siblings) 5 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> SECCOMP_CACHE will only operate on syscalls that do not access any syscall arguments or instruction pointer. To facilitate this we need a static analyser to know whether a filter will return allow regardless of syscall arguments for a given architecture number / syscall number pair. This is implemented here with a pseudo-emulator, and stored in a per-filter bitmap. In order to build this bitmap at filter attach time, each filter is emulated for every syscall (under each possible architecture), and checked for any accesses of struct seccomp_data that are not the "arch" nor "nr" (syscall) members. If only "arch" and "nr" are examined, and the program returns allow, then we can be sure that the filter must return allow independent from syscall arguments. Nearly all seccomp filters are built from these cBPF instructions: BPF_LD | BPF_W | BPF_ABS BPF_JMP | BPF_JEQ | BPF_K BPF_JMP | BPF_JGE | BPF_K BPF_JMP | BPF_JGT | BPF_K BPF_JMP | BPF_JSET | BPF_K BPF_JMP | BPF_JA BPF_RET | BPF_K BPF_ALU | BPF_AND | BPF_K Each of these instructions are emulated. Any weirdness or loading from a syscall argument will cause the emulator to bail. The emulation is also halted if it reaches a return. In that case, if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. Emulator structure and comments are from Kees [1] and Jann [2]. Emulation is done at attach time. If a filter depends on more filters, and if the dependee does not guarantee to allow the syscall, then we skip the emulation of this syscall. [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ Suggested-by: Jann Horn <jannh@google.com> Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- kernel/seccomp.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 155 insertions(+), 1 deletion(-) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index d67a8b61f2bf..236e7b367d4e 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -169,6 +169,10 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte { return false; } + +static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ +} #endif /* SECCOMP_ARCH_NATIVE */ /** @@ -187,6 +191,7 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte * this filter after reaching 0. The @users count is always smaller * or equal to @refs. Hence, reaching 0 for @users does not mean * the filter can be freed. + * @cache: cache of arch/syscall mappings to actions * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged * @prev: points to a previously installed, or inherited, filter * @prog: the BPF program to evaluate @@ -208,6 +213,7 @@ struct seccomp_filter { refcount_t refs; refcount_t users; bool log; + struct action_cache cache; struct seccomp_filter *prev; struct bpf_prog *prog; struct notification *notif; @@ -621,7 +627,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) { struct seccomp_filter *sfilter; int ret; - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); + const bool save_orig = +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE) + true; +#else + false; +#endif if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) return ERR_PTR(-EINVAL); @@ -687,6 +698,148 @@ seccomp_prepare_user_filter(const char __user *user_filter) return filter; } +#ifdef SECCOMP_ARCH_NATIVE +/** + * seccomp_is_const_allow - check if filter is constant allow with given data + * @fprog: The BPF programs + * @sd: The seccomp data to check against, only syscall number and arch + * number are considered constant. + */ +static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog, + struct seccomp_data *sd) +{ + unsigned int reg_value = 0; + unsigned int pc; + bool op_res; + + if (WARN_ON_ONCE(!fprog)) + return false; + + for (pc = 0; pc < fprog->len; pc++) { + struct sock_filter *insn = &fprog->filter[pc]; + u16 code = insn->code; + u32 k = insn->k; + + switch (code) { + case BPF_LD | BPF_W | BPF_ABS: + switch (k) { + case offsetof(struct seccomp_data, nr): + reg_value = sd->nr; + break; + case offsetof(struct seccomp_data, arch): + reg_value = sd->arch; + break; + default: + /* can't optimize (non-constant value load) */ + return false; + } + break; + case BPF_RET | BPF_K: + /* reached return with constant values only, check allow */ + return k == SECCOMP_RET_ALLOW; + case BPF_JMP | BPF_JA: + pc += insn->k; + break; + case BPF_JMP | BPF_JEQ | BPF_K: + case BPF_JMP | BPF_JGE | BPF_K: + case BPF_JMP | BPF_JGT | BPF_K: + case BPF_JMP | BPF_JSET | BPF_K: + switch (BPF_OP(code)) { + case BPF_JEQ: + op_res = reg_value == k; + break; + case BPF_JGE: + op_res = reg_value >= k; + break; + case BPF_JGT: + op_res = reg_value > k; + break; + case BPF_JSET: + op_res = !!(reg_value & k); + break; + default: + /* can't optimize (unknown jump) */ + return false; + } + + pc += op_res ? insn->jt : insn->jf; + break; + case BPF_ALU | BPF_AND | BPF_K: + reg_value &= k; + break; + default: + /* can't optimize (unknown insn) */ + return false; + } + } + + /* ran off the end of the filter?! */ + WARN_ON(1); + return false; +} + +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter, + void *bitmap, const void *bitmap_prev, + size_t bitmap_size, int arch) +{ + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; + struct seccomp_data sd; + int nr; + + if (bitmap_prev) { + /* The new filter must be as restrictive as the last. */ + bitmap_copy(bitmap, bitmap_prev, bitmap_size); + } else { + /* Before any filters, all syscalls are always allowed. */ + bitmap_fill(bitmap, bitmap_size); + } + + for (nr = 0; nr < bitmap_size; nr++) { + /* No bitmap change: not a cacheable action. */ + if (!test_bit(nr, bitmap)) + continue; + + sd.nr = nr; + sd.arch = arch; + + /* No bitmap change: continue to always allow. */ + if (seccomp_is_const_allow(fprog, &sd)) + continue; + + /* + * Not a cacheable action: always run filters. + * atomic clear_bit() not needed, filter not visible yet. + */ + __clear_bit(nr, bitmap); + } +} + +/** + * seccomp_cache_prepare - emulate the filter to find cachable syscalls + * @sfilter: The seccomp filter + * + * Returns 0 if successful or -errno if error occurred. + */ +static void seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ + struct action_cache *cache = &sfilter->cache; + const struct action_cache *cache_prev = + sfilter->prev ? &sfilter->prev->cache : NULL; + + seccomp_cache_prepare_bitmap(sfilter, cache->allow_native, + cache_prev ? cache_prev->allow_native : NULL, + SECCOMP_ARCH_NATIVE_NR, + SECCOMP_ARCH_NATIVE); + +#ifdef SECCOMP_ARCH_COMPAT + seccomp_cache_prepare_bitmap(sfilter, cache->allow_compat, + cache_prev ? cache_prev->allow_compat : NULL, + SECCOMP_ARCH_COMPAT_NR, + SECCOMP_ARCH_COMPAT); +#endif /* SECCOMP_ARCH_COMPAT */ +} +#endif /* SECCOMP_ARCH_NATIVE */ + /** * seccomp_attach_filter: validate and attach filter * @flags: flags to change filter behavior @@ -736,6 +889,7 @@ static long seccomp_attach_filter(unsigned int flags, * task reference. */ filter->prev = current->seccomp.filter; + seccomp_cache_prepare(filter); current->seccomp.filter = filter; atomic_inc(¤t->seccomp.filter_count); -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow 2020-10-11 15:47 ` [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu @ 2020-10-12 6:46 ` Jann Horn 0 siblings, 0 replies; 149+ messages in thread From: Jann Horn @ 2020-10-12 6:46 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Sun, Oct 11, 2020 at 5:48 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > SECCOMP_CACHE will only operate on syscalls that do not access > any syscall arguments or instruction pointer. To facilitate > this we need a static analyser to know whether a filter will > return allow regardless of syscall arguments for a given > architecture number / syscall number pair. This is implemented > here with a pseudo-emulator, and stored in a per-filter bitmap. > > In order to build this bitmap at filter attach time, each filter is > emulated for every syscall (under each possible architecture), and > checked for any accesses of struct seccomp_data that are not the "arch" > nor "nr" (syscall) members. If only "arch" and "nr" are examined, and > the program returns allow, then we can be sure that the filter must > return allow independent from syscall arguments. > > Nearly all seccomp filters are built from these cBPF instructions: > > BPF_LD | BPF_W | BPF_ABS > BPF_JMP | BPF_JEQ | BPF_K > BPF_JMP | BPF_JGE | BPF_K > BPF_JMP | BPF_JGT | BPF_K > BPF_JMP | BPF_JSET | BPF_K > BPF_JMP | BPF_JA > BPF_RET | BPF_K > BPF_ALU | BPF_AND | BPF_K > > Each of these instructions are emulated. Any weirdness or loading > from a syscall argument will cause the emulator to bail. > > The emulation is also halted if it reaches a return. In that case, > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. > > Emulator structure and comments are from Kees [1] and Jann [2]. > > Emulation is done at attach time. If a filter depends on more > filters, and if the dependee does not guarantee to allow the > syscall, then we skip the emulation of this syscall. > > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ > > Suggested-by: Jann Horn <jannh@google.com> > Co-developed-by: Kees Cook <keescook@chromium.org> > Signed-off-by: Kees Cook <keescook@chromium.org> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> Reviewed-by: Jann Horn <jannh@google.com> ^ permalink raw reply [flat|nested] 149+ messages in thread
* [PATCH v5 seccomp 3/5] x86: Enable seccomp architecture tracking 2020-10-11 15:47 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu @ 2020-10-11 15:47 ` YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu ` (2 subsequent siblings) 5 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: Kees Cook <keescook@chromium.org> Provide seccomp internals with the details to calculate which syscall table the running kernel is expecting to deal with. This allows for efficient architecture pinning and paves the way for constant-action bitmaps. Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/x86/include/asm/seccomp.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h index 2bd1338de236..b17d037c72ce 100644 --- a/arch/x86/include/asm/seccomp.h +++ b/arch/x86/include/asm/seccomp.h @@ -16,6 +16,23 @@ #define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn #endif +#ifdef CONFIG_X86_64 +# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_X86_64 +# define SECCOMP_ARCH_NATIVE_NR NR_syscalls +# ifdef CONFIG_COMPAT +# define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 +# define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls +# endif +/* + * x32 will have __X32_SYSCALL_BIT set in syscall number. We don't support + * caching them and they are treated as out of range syscalls, which will + * always pass through the BPF filter. + */ +#else /* !CONFIG_X86_64 */ +# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_I386 +# define SECCOMP_ARCH_NATIVE_NR NR_syscalls +#endif + #include <asm-generic/seccomp.h> #endif /* _ASM_X86_SECCOMP_H */ -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* [PATCH v5 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead 2020-10-11 15:47 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu ` (2 preceding siblings ...) 2020-10-11 15:47 ` [PATCH v5 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu @ 2020-10-11 15:47 ` YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-10-27 19:14 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results Kees Cook 5 siblings, 0 replies; 149+ messages in thread From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: Kees Cook <keescook@chromium.org> As part of the seccomp benchmarking, include the expectations with regard to the timing behavior of the constant action bitmaps, and report inconsistencies better. Example output with constant action bitmaps on x86: $ sudo ./seccomp_benchmark 100000000 Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 200000000 syscalls... 129.359381409 - 0.008724424 = 129350656985 (129.4s) getpid native: 646 ns 264.385890006 - 129.360453229 = 135025436777 (135.0s) getpid RET_ALLOW 1 filter (bitmap): 675 ns 399.400511893 - 264.387045901 = 135013465992 (135.0s) getpid RET_ALLOW 2 filters (bitmap): 675 ns 545.872866260 - 399.401718327 = 146471147933 (146.5s) getpid RET_ALLOW 3 filters (full): 732 ns 696.337101319 - 545.874097681 = 150463003638 (150.5s) getpid RET_ALLOW 4 filters (full): 752 ns Estimated total seccomp overhead for 1 bitmapped filter: 29 ns Estimated total seccomp overhead for 2 bitmapped filters: 29 ns Estimated total seccomp overhead for 3 full filters: 86 ns Estimated total seccomp overhead for 4 full filters: 106 ns Estimated seccomp entry overhead: 29 ns Estimated seccomp per-filter overhead (last 2 diff): 20 ns Estimated seccomp per-filter overhead (filters / 4): 19 ns Expectations: native ≤ 1 bitmap (646 ≤ 675): ✔️ native ≤ 1 filter (646 ≤ 732): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️ 1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️ entry ≈ 1 bitmapped (29 ≈ 29): ✔️ entry ≈ 2 bitmapped (29 ≈ 29): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️ Signed-off-by: Kees Cook <keescook@chromium.org> [YiFei: Changed commit message to show stats for this patch series] Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- .../selftests/seccomp/seccomp_benchmark.c | 151 +++++++++++++++--- tools/testing/selftests/seccomp/settings | 2 +- 2 files changed, 130 insertions(+), 23 deletions(-) diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c index 91f5a89cadac..fcc806585266 100644 --- a/tools/testing/selftests/seccomp/seccomp_benchmark.c +++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c @@ -4,12 +4,16 @@ */ #define _GNU_SOURCE #include <assert.h> +#include <limits.h> +#include <stdbool.h> +#include <stddef.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <linux/filter.h> #include <linux/seccomp.h> +#include <sys/param.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <sys/types.h> @@ -70,18 +74,74 @@ unsigned long long calibrate(void) return samples * seconds; } +bool approx(int i_one, int i_two) +{ + double one = i_one, one_bump = one * 0.01; + double two = i_two, two_bump = two * 0.01; + + one_bump = one + MAX(one_bump, 2.0); + two_bump = two + MAX(two_bump, 2.0); + + /* Equal to, or within 1% or 2 digits */ + if (one == two || + (one > two && one <= two_bump) || + (two > one && two <= one_bump)) + return true; + return false; +} + +bool le(int i_one, int i_two) +{ + if (i_one <= i_two) + return true; + return false; +} + +long compare(const char *name_one, const char *name_eval, const char *name_two, + unsigned long long one, bool (*eval)(int, int), unsigned long long two) +{ + bool good; + + printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two, + (long long)one, name_eval, (long long)two); + if (one > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)one); + return 1; + } + if (two > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)two); + return 1; + } + + good = eval(one, two); + printf("%s\n", good ? "✔️" : "❌"); + + return good ? 0 : 1; +} + int main(int argc, char *argv[]) { + struct sock_filter bitmap_filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), + }; + struct sock_fprog bitmap_prog = { + .len = (unsigned short)ARRAY_SIZE(bitmap_filter), + .filter = bitmap_filter, + }; struct sock_filter filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog prog = { .len = (unsigned short)ARRAY_SIZE(filter), .filter = filter, }; - long ret; - unsigned long long samples; - unsigned long long native, filter1, filter2; + + long ret, bits; + unsigned long long samples, calc; + unsigned long long native, filter1, filter2, bitmap1, bitmap2; + unsigned long long entry, per_filter1, per_filter2; printf("Current BPF sysctl settings:\n"); system("sysctl net.core.bpf_jit_enable"); @@ -101,35 +161,82 @@ int main(int argc, char *argv[]) ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); assert(ret == 0); - /* One filter */ - ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); + /* One filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); assert(ret == 0); - filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1); + bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1); + + /* Second filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - if (filter1 == native) - printf("No overhead measured!? Try running again with more samples.\n"); + bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2); - /* Two filters */ + /* Third filter, can no longer be converted to bitmap */ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); assert(ret == 0); - filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2); - - /* Calculations */ - printf("Estimated total seccomp overhead for 1 filter: %llu ns\n", - filter1 - native); + filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1); - printf("Estimated total seccomp overhead for 2 filters: %llu ns\n", - filter2 - native); + /* Fourth filter, can not be converted to bitmap because of filter 3 */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - printf("Estimated seccomp per-filter overhead: %llu ns\n", - filter2 - filter1); + filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2); + + /* Estimations */ +#define ESTIMATE(fmt, var, what) do { \ + var = (what); \ + printf("Estimated " fmt ": %llu ns\n", var); \ + if (var > INT_MAX) \ + goto more_samples; \ + } while (0) + + ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc, + bitmap1 - native); + ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc, + bitmap2 - native); + ESTIMATE("total seccomp overhead for 3 full filters", calc, + filter1 - native); + ESTIMATE("total seccomp overhead for 4 full filters", calc, + filter2 - native); + ESTIMATE("seccomp entry overhead", entry, + bitmap1 - native - (bitmap2 - bitmap1)); + ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1, + filter2 - filter1); + ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2, + (filter2 - native - entry) / 4); + + printf("Expectations:\n"); + ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1); + bits = compare("native", "≤", "1 filter", native, le, filter1); + if (bits) + goto more_samples; + + ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)", + per_filter1, approx, per_filter2); + + bits = compare("1 bitmapped", "≈", "2 bitmapped", + bitmap1 - native, approx, bitmap2 - native); + if (bits) { + printf("Skipping constant action bitmap expectations: they appear unsupported.\n"); + goto out; + } - printf("Estimated seccomp entry overhead: %llu ns\n", - filter1 - native - (filter2 - filter1)); + ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native); + ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native); + ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total", + entry + (per_filter1 * 4) + native, approx, filter2); + if (ret == 0) + goto out; +more_samples: + printf("Saw unexpected benchmark result. Try running again with more samples?\n"); +out: return 0; } diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings index ba4d85f74cd6..6091b45d226b 100644 --- a/tools/testing/selftests/seccomp/settings +++ b/tools/testing/selftests/seccomp/settings @@ -1 +1 @@ -timeout=90 +timeout=120 -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-11 15:47 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu ` (3 preceding siblings ...) 2020-10-11 15:47 ` [PATCH v5 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu @ 2020-10-11 15:47 ` YiFei Zhu 2020-10-12 6:49 ` Jann Horn 2020-12-17 12:14 ` Geert Uytterhoeven 2020-10-27 19:14 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results Kees Cook 5 siblings, 2 replies; 149+ messages in thread From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw) To: containers Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry From: YiFei Zhu <yifeifz2@illinois.edu> Currently the kernel does not provide an infrastructure to translate architecture numbers to a human-readable name. Translating syscall numbers to syscall names is possible through FTRACE_SYSCALL infrastructure but it does not provide support for compat syscalls. This will create a file for each PID as /proc/pid/seccomp_cache. The file will be empty when no seccomp filters are loaded, or be in the format of: <arch name> <decimal syscall number> <ALLOW | FILTER> where ALLOW means the cache is guaranteed to allow the syscall, and filter means the cache will pass the syscall to the BPF filter. For the docker default profile on x86_64 it looks like: x86_64 0 ALLOW x86_64 1 ALLOW x86_64 2 ALLOW x86_64 3 ALLOW [...] x86_64 132 ALLOW x86_64 133 ALLOW x86_64 134 FILTER x86_64 135 FILTER x86_64 136 FILTER x86_64 137 ALLOW x86_64 138 ALLOW x86_64 139 FILTER x86_64 140 ALLOW x86_64 141 ALLOW [...] This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default of N because I think certain users of seccomp might not want the application to know which syscalls are definitely usable. For the same reason, it is also guarded by CAP_SYS_ADMIN. Suggested-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> --- arch/Kconfig | 24 ++++++++++++++ arch/x86/Kconfig | 1 + arch/x86/include/asm/seccomp.h | 3 ++ fs/proc/base.c | 6 ++++ include/linux/seccomp.h | 7 ++++ kernel/seccomp.c | 59 ++++++++++++++++++++++++++++++++++ 6 files changed, 100 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index 21a3675a7a3a..6157c3ce0662 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -471,6 +471,15 @@ config HAVE_ARCH_SECCOMP_FILTER results in the system call being skipped immediately. - seccomp syscall wired up +config HAVE_ARCH_SECCOMP_CACHE + bool + help + An arch should select this symbol if it provides all of these things: + - all the requirements for HAVE_ARCH_SECCOMP_FILTER + - SECCOMP_ARCH_NATIVE + - SECCOMP_ARCH_NATIVE_NR + - SECCOMP_ARCH_NATIVE_NAME + config SECCOMP prompt "Enable seccomp to safely execute untrusted bytecode" def_bool y @@ -498,6 +507,21 @@ config SECCOMP_FILTER See Documentation/userspace-api/seccomp_filter.rst for details. +config SECCOMP_CACHE_DEBUG + bool "Show seccomp filter cache status in /proc/pid/seccomp_cache" + depends on SECCOMP + depends on SECCOMP_FILTER && HAVE_ARCH_SECCOMP_CACHE + depends on PROC_FS + help + This enables the /proc/pid/seccomp_cache interface to monitor + seccomp cache data. The file format is subject to change. Reading + the file requires CAP_SYS_ADMIN. + + This option is for debugging only. Enabling presents the risk that + an adversary may be able to infer the seccomp filter logic. + + If unsure, say N. + config HAVE_ARCH_STACKLEAK bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1ab22869a765..1a807f89ac77 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -150,6 +150,7 @@ config X86 select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_SECCOMP_CACHE select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_STACKLEAK select HAVE_ARCH_TRACEHOOK diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h index b17d037c72ce..fef16e398161 100644 --- a/arch/x86/include/asm/seccomp.h +++ b/arch/x86/include/asm/seccomp.h @@ -19,9 +19,11 @@ #ifdef CONFIG_X86_64 # define SECCOMP_ARCH_NATIVE AUDIT_ARCH_X86_64 # define SECCOMP_ARCH_NATIVE_NR NR_syscalls +# define SECCOMP_ARCH_NATIVE_NAME "x86_64" # ifdef CONFIG_COMPAT # define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 # define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls +# define SECCOMP_ARCH_COMPAT_NAME "ia32" # endif /* * x32 will have __X32_SYSCALL_BIT set in syscall number. We don't support @@ -31,6 +33,7 @@ #else /* !CONFIG_X86_64 */ # define SECCOMP_ARCH_NATIVE AUDIT_ARCH_I386 # define SECCOMP_ARCH_NATIVE_NR NR_syscalls +# define SECCOMP_ARCH_NATIVE_NAME "ia32" #endif #include <asm-generic/seccomp.h> diff --git a/fs/proc/base.c b/fs/proc/base.c index 617db4e0faa0..a4990410ff05 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_PROC_PID_ARCH_STATUS ONE("arch_status", S_IRUGO, proc_pid_arch_status), #endif +#ifdef CONFIG_SECCOMP_CACHE_DEBUG + ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache), +#endif }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) @@ -3587,6 +3590,9 @@ static const struct pid_entry tid_base_stuff[] = { #ifdef CONFIG_PROC_PID_ARCH_STATUS ONE("arch_status", S_IRUGO, proc_pid_arch_status), #endif +#ifdef CONFIG_SECCOMP_CACHE_DEBUG + ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache), +#endif }; static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 02aef2844c38..76963ec4641a 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -121,4 +121,11 @@ static inline long seccomp_get_metadata(struct task_struct *task, return -EINVAL; } #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */ + +#ifdef CONFIG_SECCOMP_CACHE_DEBUG +struct seq_file; + +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task); +#endif #endif /* _LINUX_SECCOMP_H */ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 236e7b367d4e..1df2fac281da 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -553,6 +553,9 @@ void seccomp_filter_release(struct task_struct *tsk) { struct seccomp_filter *orig = tsk->seccomp.filter; + /* We are effectively holding the siglock by not having any sighand. */ + WARN_ON(tsk->sighand != NULL); + /* Detach task from its filter tree. */ tsk->seccomp.filter = NULL; __seccomp_filter_release(orig); @@ -2311,3 +2314,59 @@ static int __init seccomp_sysctl_init(void) device_initcall(seccomp_sysctl_init) #endif /* CONFIG_SYSCTL */ + +#ifdef CONFIG_SECCOMP_CACHE_DEBUG +/* Currently CONFIG_SECCOMP_CACHE_DEBUG implies SECCOMP_ARCH_NATIVE */ +static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name, + const void *bitmap, size_t bitmap_size) +{ + int nr; + + for (nr = 0; nr < bitmap_size; nr++) { + bool cached = test_bit(nr, bitmap); + char *status = cached ? "ALLOW" : "FILTER"; + + seq_printf(m, "%s %d %s\n", name, nr, status); + } +} + +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task) +{ + struct seccomp_filter *f; + unsigned long flags; + + /* + * We don't want some sandboxed process to know what their seccomp + * filters consist of. + */ + if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)) + return -EACCES; + + if (!lock_task_sighand(task, &flags)) + return -ESRCH; + + f = READ_ONCE(task->seccomp.filter); + if (!f) { + unlock_task_sighand(task, &flags); + return 0; + } + + /* prevent filter from being freed while we are printing it */ + __get_seccomp_filter(f); + unlock_task_sighand(task, &flags); + + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_NATIVE_NAME, + f->cache.allow_native, + SECCOMP_ARCH_NATIVE_NR); + +#ifdef SECCOMP_ARCH_COMPAT + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME, + f->cache.allow_compat, + SECCOMP_ARCH_COMPAT_NR); +#endif /* SECCOMP_ARCH_COMPAT */ + + __put_seccomp_filter(f); + return 0; +} +#endif /* CONFIG_SECCOMP_CACHE_DEBUG */ -- 2.28.0 ^ permalink raw reply related [flat|nested] 149+ messages in thread
* Re: [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-11 15:47 ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu @ 2020-10-12 6:49 ` Jann Horn 2020-12-17 12:14 ` Geert Uytterhoeven 1 sibling, 0 replies; 149+ messages in thread From: Jann Horn @ 2020-10-12 6:49 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Sun, Oct 11, 2020 at 5:48 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > Currently the kernel does not provide an infrastructure to translate > architecture numbers to a human-readable name. Translating syscall > numbers to syscall names is possible through FTRACE_SYSCALL > infrastructure but it does not provide support for compat syscalls. > > This will create a file for each PID as /proc/pid/seccomp_cache. > The file will be empty when no seccomp filters are loaded, or be > in the format of: > <arch name> <decimal syscall number> <ALLOW | FILTER> > where ALLOW means the cache is guaranteed to allow the syscall, > and filter means the cache will pass the syscall to the BPF filter. > > For the docker default profile on x86_64 it looks like: > x86_64 0 ALLOW > x86_64 1 ALLOW > x86_64 2 ALLOW > x86_64 3 ALLOW > [...] > x86_64 132 ALLOW > x86_64 133 ALLOW > x86_64 134 FILTER > x86_64 135 FILTER > x86_64 136 FILTER > x86_64 137 ALLOW > x86_64 138 ALLOW > x86_64 139 FILTER > x86_64 140 ALLOW > x86_64 141 ALLOW > [...] > > This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default > of N because I think certain users of seccomp might not want the > application to know which syscalls are definitely usable. For > the same reason, it is also guarded by CAP_SYS_ADMIN. > > Suggested-by: Jann Horn <jannh@google.com> > Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> Reviewed-by: Jann Horn <jannh@google.com> ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-10-11 15:47 ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-10-12 6:49 ` Jann Horn @ 2020-12-17 12:14 ` Geert Uytterhoeven 2020-12-17 18:34 ` YiFei Zhu 1 sibling, 1 reply; 149+ messages in thread From: Geert Uytterhoeven @ 2020-12-17 12:14 UTC (permalink / raw) To: YiFei Zhu Cc: containers, YiFei Zhu, bpf, Linux Kernel Mailing List, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry Hi Yifei, On Sun, Oct 11, 2020 at 8:08 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > Currently the kernel does not provide an infrastructure to translate > architecture numbers to a human-readable name. Translating syscall > numbers to syscall names is possible through FTRACE_SYSCALL > infrastructure but it does not provide support for compat syscalls. > > This will create a file for each PID as /proc/pid/seccomp_cache. > The file will be empty when no seccomp filters are loaded, or be > in the format of: > <arch name> <decimal syscall number> <ALLOW | FILTER> > where ALLOW means the cache is guaranteed to allow the syscall, > and filter means the cache will pass the syscall to the BPF filter. > > For the docker default profile on x86_64 it looks like: > x86_64 0 ALLOW > x86_64 1 ALLOW > x86_64 2 ALLOW > x86_64 3 ALLOW > [...] > x86_64 132 ALLOW > x86_64 133 ALLOW > x86_64 134 FILTER > x86_64 135 FILTER > x86_64 136 FILTER > x86_64 137 ALLOW > x86_64 138 ALLOW > x86_64 139 FILTER > x86_64 140 ALLOW > x86_64 141 ALLOW > [...] > > This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default > of N because I think certain users of seccomp might not want the > application to know which syscalls are definitely usable. For > the same reason, it is also guarded by CAP_SYS_ADMIN. > > Suggested-by: Jann Horn <jannh@google.com> > Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> > @@ -2311,3 +2314,59 @@ static int __init seccomp_sysctl_init(void) > device_initcall(seccomp_sysctl_init) > > #endif /* CONFIG_SYSCTL */ > + > +#ifdef CONFIG_SECCOMP_CACHE_DEBUG > +/* Currently CONFIG_SECCOMP_CACHE_DEBUG implies SECCOMP_ARCH_NATIVE */ Should there be a dependency on SECCOMP_ARCH_NATIVE? Should all architectures that implement seccomp have this? E.g. mips does select HAVE_ARCH_SECCOMP_FILTER, but doesn't have SECCOMP_ARCH_NATIVE? (noticed with preliminary out-of-tree seccomp implementation for m68k, which doesn't have SECCOMP_ARCH_NATIVE > +static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name, > + const void *bitmap, size_t bitmap_size) > +{ > + int nr; > + > + for (nr = 0; nr < bitmap_size; nr++) { > + bool cached = test_bit(nr, bitmap); > + char *status = cached ? "ALLOW" : "FILTER"; > + > + seq_printf(m, "%s %d %s\n", name, nr, status); > + } > +} > + > +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, > + struct pid *pid, struct task_struct *task) > +{ > + struct seccomp_filter *f; > + unsigned long flags; > + > + /* > + * We don't want some sandboxed process to know what their seccomp > + * filters consist of. > + */ > + if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)) > + return -EACCES; > + > + if (!lock_task_sighand(task, &flags)) > + return -ESRCH; > + > + f = READ_ONCE(task->seccomp.filter); > + if (!f) { > + unlock_task_sighand(task, &flags); > + return 0; > + } > + > + /* prevent filter from being freed while we are printing it */ > + __get_seccomp_filter(f); > + unlock_task_sighand(task, &flags); > + > + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_NATIVE_NAME, > + f->cache.allow_native, error: ‘struct action_cache’ has no member named ‘allow_native’ struct action_cache is empty if SECCOMP_ARCH_NATIVE is not defined (so there are checks for it). > + SECCOMP_ARCH_NATIVE_NR); > + > +#ifdef SECCOMP_ARCH_COMPAT > + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME, > + f->cache.allow_compat, > + SECCOMP_ARCH_COMPAT_NR); > +#endif /* SECCOMP_ARCH_COMPAT */ > + > + __put_seccomp_filter(f); > + return 0; > +} > +#endif /* CONFIG_SECCOMP_CACHE_DEBUG */ > -- > 2.28.0 > -- Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-12-17 12:14 ` Geert Uytterhoeven @ 2020-12-17 18:34 ` YiFei Zhu 2020-12-18 12:35 ` Geert Uytterhoeven 0 siblings, 1 reply; 149+ messages in thread From: YiFei Zhu @ 2020-12-17 18:34 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Linux Containers, YiFei Zhu, bpf, Linux Kernel Mailing List, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry On Thu, Dec 17, 2020 at 6:14 AM Geert Uytterhoeven <geert@linux-m68k.org> wrote: > Should there be a dependency on SECCOMP_ARCH_NATIVE? > Should all architectures that implement seccomp have this? > > E.g. mips does select HAVE_ARCH_SECCOMP_FILTER, but doesn't > have SECCOMP_ARCH_NATIVE? > > (noticed with preliminary out-of-tree seccomp implementation for m68k, > which doesn't have SECCOMP_ARCH_NATIVE Hi Geert You are correct. This specific patch in this series was not applied, and this was addressed in a follow up patch series [1]. MIPS does not define SECCOMP_ARCH_NATIVE because the bitmap expects syscall numbers to start from 0, whereas MIPS does not (defines CONFIG_HAVE_SPARSE_SYSCALL_NR). The follow up patch makes it so that any arch with HAVE_SPARSE_SYSCALL_NR (currently just MIPS) cannot have CONFIG_SECCOMP_CACHE_DEBUG on, by the depend on clause. I see that you are doing an out of tree seccomp implementation for m68k. Assuming unchanged arch/xtensa/include/asm/syscall.h, something like this to arch/m68k/include/asm/seccomp.h should make it work: #define SECCOMP_ARCH_NATIVE AUDIT_ARCH_M68K #define SECCOMP_ARCH_NATIVE_NR NR_syscalls #define SECCOMP_ARCH_NATIVE_NAME "m68k" If the file does not exist already, arch/xtensa/include/asm/seccomp.h is a good example of how the file should look like, and remember to remove `generic-y += seccomp.h` from arch/m68k/include/asm/Kbuild. [1] https://lore.kernel.org/lkml/cover.1605101222.git.yifeifz2@illinois.edu/T/ YiFei Zhu ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache 2020-12-17 18:34 ` YiFei Zhu @ 2020-12-18 12:35 ` Geert Uytterhoeven 0 siblings, 0 replies; 149+ messages in thread From: Geert Uytterhoeven @ 2020-12-18 12:35 UTC (permalink / raw) To: YiFei Zhu Cc: Linux Containers, YiFei Zhu, bpf, Linux Kernel Mailing List, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight, Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg, Will Drewry Hi YiFei, On Thu, Dec 17, 2020 at 7:34 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > On Thu, Dec 17, 2020 at 6:14 AM Geert Uytterhoeven <geert@linux-m68k.org> wrote: > > Should there be a dependency on SECCOMP_ARCH_NATIVE? > > Should all architectures that implement seccomp have this? > > > > E.g. mips does select HAVE_ARCH_SECCOMP_FILTER, but doesn't > > have SECCOMP_ARCH_NATIVE? > > > > (noticed with preliminary out-of-tree seccomp implementation for m68k, > > which doesn't have SECCOMP_ARCH_NATIVE > > You are correct. This specific patch in this series was not applied, > and this was addressed in a follow up patch series [1]. MIPS does not > define SECCOMP_ARCH_NATIVE because the bitmap expects syscall numbers > to start from 0, whereas MIPS does not (defines > CONFIG_HAVE_SPARSE_SYSCALL_NR). The follow up patch makes it so that > any arch with HAVE_SPARSE_SYSCALL_NR (currently just MIPS) cannot have > CONFIG_SECCOMP_CACHE_DEBUG on, by the depend on clause. > > I see that you are doing an out of tree seccomp implementation for > m68k. Assuming unchanged arch/xtensa/include/asm/syscall.h, something > like this to arch/m68k/include/asm/seccomp.h should make it work: > > #define SECCOMP_ARCH_NATIVE AUDIT_ARCH_M68K > #define SECCOMP_ARCH_NATIVE_NR NR_syscalls > #define SECCOMP_ARCH_NATIVE_NAME "m68k" > > If the file does not exist already, arch/xtensa/include/asm/seccomp.h > is a good example of how the file should look like, and remember to > remove `generic-y += seccomp.h` from arch/m68k/include/asm/Kbuild. > > [1] https://lore.kernel.org/lkml/cover.1605101222.git.yifeifz2@illinois.edu/T/ Thank you for your extensive explanation. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 149+ messages in thread
* Re: [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results 2020-10-11 15:47 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu ` (4 preceding siblings ...) 2020-10-11 15:47 ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu @ 2020-10-27 19:14 ` Kees Cook 5 siblings, 0 replies; 149+ messages in thread From: Kees Cook @ 2020-10-27 19:14 UTC (permalink / raw) To: YiFei Zhu, containers Cc: Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Jack Chen, YiFei Zhu, Valentin Rothberg, Andrea Arcangeli, Dimitrios Skarlatos, Andy Lutomirski, David Laight, bpf, Jann Horn, Giuseppe Scrivano, Josep Torrellas, Hubertus Franke, Will Drewry, linux-kernel, Tycho Andersen, Aleksa Sarai On Sun, 11 Oct 2020 10:47:41 -0500, YiFei Zhu wrote: > Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/ > > Major differences from the linked alternative by Kees: > * No x32 special-case handling -- not worth the complexity > * No caching of denylist -- not worth the complexity > * No seccomp arch pinning -- I think this is an independent feature > * The bitmaps are part of the filters rather than the task. > > [...] Applied to for-next/seccomp, thanks! I left off patch 5 for now until we sort out the rest of the SECCOMP_FILTER architectures, and tweaked patch 3 to include the architecture names. [1/4] seccomp/cache: Lookup syscall allowlist bitmap for fast path https://git.kernel.org/kees/c/f94defb8bf46 [2/4] seccomp/cache: Add "emulator" to check if filter is constant allow https://git.kernel.org/kees/c/e7dc9f1e5f6b [3/4] x86: Enable seccomp architecture tracking https://git.kernel.org/kees/c/1f68a4d393fe [4/4] selftests/seccomp: Compare bitmap vs filter overhead https://git.kernel.org/kees/c/57a339117e52 -- Kees Cook ^ permalink raw reply [flat|nested] 149+ messages in thread
end of thread, other threads:[~2020-12-18 12:36 UTC | newest] Thread overview: 149+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-09-21 5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu 2020-09-21 5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu 2020-09-21 17:47 ` Jann Horn 2020-09-21 18:38 ` Jann Horn 2020-09-21 23:44 ` YiFei Zhu 2020-09-22 0:25 ` Jann Horn 2020-09-22 0:47 ` YiFei Zhu 2020-09-21 5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu 2020-09-21 18:08 ` Jann Horn 2020-09-21 22:50 ` YiFei Zhu 2020-09-21 22:57 ` Jann Horn 2020-09-21 23:08 ` YiFei Zhu 2020-09-25 0:01 ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook 2020-09-25 0:15 ` Jann Horn 2020-09-25 0:18 ` Al Viro 2020-09-25 0:24 ` Jann Horn 2020-09-25 1:27 ` YiFei Zhu 2020-09-25 3:09 ` Kees Cook 2020-09-25 3:28 ` YiFei Zhu 2020-09-25 16:39 ` YiFei Zhu 2020-09-21 5:48 ` [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls Sargun Dhillon 2020-09-21 7:13 ` YiFei Zhu 2020-09-21 8:30 ` Christian Brauner 2020-09-21 8:44 ` YiFei Zhu 2020-09-21 13:51 ` Tycho Andersen 2020-09-21 15:27 ` YiFei Zhu 2020-09-21 16:39 ` Tycho Andersen 2020-09-21 22:57 ` YiFei Zhu 2020-09-21 19:16 ` Jann Horn [not found] ` <OF8837FC1A.5C0D4D64-ON852585EA.006B677F-852585EA.006BA663@notes.na.collabserv.com> 2020-09-21 19:45 ` Jann Horn 2020-09-23 19:26 ` Kees Cook 2020-09-23 22:54 ` YiFei Zhu 2020-09-24 6:52 ` Kees Cook 2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu 2020-09-24 12:06 ` YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu 2020-09-24 19:11 ` Kees Cook 2020-10-27 9:52 ` Geert Uytterhoeven 2020-10-27 19:08 ` YiFei Zhu 2020-10-28 0:06 ` Kees Cook 2020-10-28 8:18 ` Geert Uytterhoeven 2020-10-28 9:34 ` Jann Horn 2020-09-24 12:44 ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu 2020-09-24 13:47 ` David Laight 2020-09-24 14:16 ` YiFei Zhu 2020-09-24 14:20 ` David Laight 2020-09-24 14:37 ` YiFei Zhu 2020-09-24 16:02 ` YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu 2020-09-24 23:25 ` Kees Cook 2020-09-25 3:04 ` YiFei Zhu 2020-09-25 16:45 ` YiFei Zhu 2020-09-25 19:42 ` Kees Cook 2020-09-25 19:51 ` Andy Lutomirski 2020-09-25 20:37 ` Kees Cook 2020-09-25 21:07 ` Andy Lutomirski 2020-09-25 23:49 ` Kees Cook 2020-09-26 0:34 ` Andy Lutomirski 2020-09-26 1:23 ` YiFei Zhu 2020-09-26 2:47 ` Andy Lutomirski 2020-09-26 4:35 ` Kees Cook 2020-09-24 12:44 ` [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu 2020-09-24 23:46 ` Kees Cook 2020-09-25 1:55 ` YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu 2020-09-24 23:47 ` Kees Cook 2020-09-25 1:35 ` YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-09-24 23:56 ` Kees Cook 2020-09-25 3:11 ` YiFei Zhu 2020-09-25 3:26 ` Kees Cook 2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 2020-09-30 15:19 ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu 2020-09-30 21:21 ` Kees Cook 2020-09-30 21:33 ` Jann Horn 2020-09-30 22:53 ` Kees Cook 2020-09-30 23:15 ` Jann Horn 2020-09-30 15:19 ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu 2020-09-30 22:24 ` Jann Horn 2020-09-30 22:49 ` Kees Cook 2020-10-01 11:28 ` YiFei Zhu 2020-10-01 21:08 ` Jann Horn 2020-09-30 22:40 ` Kees Cook 2020-10-01 11:52 ` YiFei Zhu 2020-10-01 21:05 ` Kees Cook 2020-10-02 11:08 ` YiFei Zhu 2020-10-09 4:47 ` YiFei Zhu 2020-10-09 5:41 ` Kees Cook 2020-09-30 15:19 ` [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu 2020-09-30 21:32 ` Kees Cook 2020-10-09 0:17 ` YiFei Zhu 2020-10-09 5:35 ` Kees Cook 2020-09-30 15:19 ` [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu 2020-09-30 15:19 ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-09-30 22:00 ` Jann Horn 2020-09-30 23:12 ` Kees Cook 2020-10-01 12:06 ` YiFei Zhu 2020-10-01 16:05 ` Jann Horn 2020-10-01 16:18 ` YiFei Zhu 2020-09-30 22:59 ` Kees Cook 2020-09-30 23:08 ` Jann Horn 2020-09-30 23:21 ` Kees Cook 2020-10-09 17:14 ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 2020-10-09 17:14 ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu 2020-10-09 21:30 ` Jann Horn 2020-10-09 23:18 ` Kees Cook 2020-10-09 17:14 ` [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu 2020-10-09 21:30 ` Jann Horn 2020-10-09 22:47 ` Kees Cook 2020-10-09 17:14 ` [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu 2020-10-09 17:25 ` Andy Lutomirski 2020-10-09 18:32 ` YiFei Zhu 2020-10-09 20:59 ` Andy Lutomirski 2020-10-09 17:14 ` [PATCH v4 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu 2020-10-09 17:14 ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-10-09 21:45 ` Jann Horn 2020-10-09 23:14 ` Kees Cook 2020-10-10 13:26 ` YiFei Zhu 2020-10-12 22:57 ` Kees Cook 2020-10-13 0:31 ` YiFei Zhu 2020-10-22 20:52 ` YiFei Zhu 2020-10-22 22:32 ` Kees Cook 2020-10-22 23:40 ` YiFei Zhu 2020-10-24 2:51 ` Kees Cook 2020-10-30 12:18 ` YiFei Zhu 2020-11-03 13:00 ` YiFei Zhu 2020-11-04 0:29 ` Kees Cook 2020-11-04 11:40 ` YiFei Zhu 2020-11-04 18:57 ` Kees Cook 2020-10-11 15:47 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu 2020-10-12 6:42 ` Jann Horn 2020-10-11 15:47 ` [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu 2020-10-12 6:46 ` Jann Horn 2020-10-11 15:47 ` [PATCH v5 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-10-12 6:49 ` Jann Horn 2020-12-17 12:14 ` Geert Uytterhoeven 2020-12-17 18:34 ` YiFei Zhu 2020-12-18 12:35 ` Geert Uytterhoeven 2020-10-27 19:14 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results Kees Cook
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).