From: YiFei Zhu <zhuyifei1999@gmail.com> To: containers@lists.linux-foundation.org Cc: YiFei Zhu <yifeifz2@illinois.edu>, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Aleksa Sarai <cyphar@cyphar.com>, Andrea Arcangeli <aarcange@redhat.com>, Andy Lutomirski <luto@amacapital.net>, David Laight <David.Laight@aculab.com>, Dimitrios Skarlatos <dskarlat@cs.cmu.edu>, Giuseppe Scrivano <gscrivan@redhat.com>, Hubertus Franke <frankeh@us.ibm.com>, Jack Chen <jianyan2@illinois.edu>, Jann Horn <jannh@google.com>, Josep Torrellas <torrella@illinois.edu>, Kees Cook <keescook@chromium.org>, Tianyin Xu <tyxu@illinois.edu>, Tobin Feldman-Fitzthum <tobin@ibm.com>, Tycho Andersen <tycho@tycho.pizza>, Valentin Rothberg <vrothber@redhat.com>, Will Drewry <wad@chromium.org> Subject: [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results Date: Sun, 11 Oct 2020 10:47:41 -0500 Message-ID: <cover.1602431034.git.yifeifz2@illinois.edu> (raw) In-Reply-To: <cover.1602263422.git.yifeifz2@illinois.edu> From: YiFei Zhu <yifeifz2@illinois.edu> Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/ Major differences from the linked alternative by Kees: * No x32 special-case handling -- not worth the complexity * No caching of denylist -- not worth the complexity * No seccomp arch pinning -- I think this is an independent feature * The bitmaps are part of the filters rather than the task. This series adds a bitmap to cache seccomp filter results if the result permits a syscall and is indepenent of syscall arguments. This visibly decreases seccomp overhead for most common seccomp filters with very little memory footprint. The overhead of running Seccomp filters has been part of some past discussions [1][2][3]. Oftentimes, the filters have a large number of instructions that check syscall numbers one by one and jump based on that. Some users chain BPF filters which further enlarge the overhead. A recent work [6] comprehensively measures the Seccomp overhead and shows that the overhead is non-negligible and has a non-trivial impact on application performance. We observed some common filters, such as docker's [4] or systemd's [5], will make most decisions based only on the syscall numbers, and as past discussions considered, a bitmap where each bit represents a syscall makes most sense for these filters. In order to build this bitmap at filter attach time, each filter is emulated for every syscall (under each possible architecture), and checked for any accesses of struct seccomp_data that are not the "arch" nor "nr" (syscall) members. If only "arch" and "nr" are examined, and the program returns allow, then we can be sure that the filter must return allow independent from syscall arguments. When it is concluded that an allow must occur for the given architecture and syscall pair, seccomp will immediately allow the syscall, bypassing further BPF execution. Ongoing work is to further support arguments with fast hash table lookups. We are investigating the performance of doing so [6], and how to best integrate with the existing seccomp infrastructure. Some benchmarks are performed with results in patch 5, copied below: Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 200000000 syscalls... 129.359381409 - 0.008724424 = 129350656985 (129.4s) getpid native: 646 ns 264.385890006 - 129.360453229 = 135025436777 (135.0s) getpid RET_ALLOW 1 filter (bitmap): 675 ns 399.400511893 - 264.387045901 = 135013465992 (135.0s) getpid RET_ALLOW 2 filters (bitmap): 675 ns 545.872866260 - 399.401718327 = 146471147933 (146.5s) getpid RET_ALLOW 3 filters (full): 732 ns 696.337101319 - 545.874097681 = 150463003638 (150.5s) getpid RET_ALLOW 4 filters (full): 752 ns Estimated total seccomp overhead for 1 bitmapped filter: 29 ns Estimated total seccomp overhead for 2 bitmapped filters: 29 ns Estimated total seccomp overhead for 3 full filters: 86 ns Estimated total seccomp overhead for 4 full filters: 106 ns Estimated seccomp entry overhead: 29 ns Estimated seccomp per-filter overhead (last 2 diff): 20 ns Estimated seccomp per-filter overhead (filters / 4): 19 ns Expectations: native ≤ 1 bitmap (646 ≤ 675): ✔️ native ≤ 1 filter (646 ≤ 732): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️ 1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️ entry ≈ 1 bitmapped (29 ≈ 29): ✔️ entry ≈ 2 bitmapped (29 ≈ 29): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️ v4 -> v5: * Typo and wording fixes * Skip arch number test when there are only one arch * Fixed prog instruction number check. * Added comment about the behavior of x32. * /proc/pid/seccomp_cache return -ESRCH for exiting process. * Fixed /proc/pid/seccomp_cache depend on the architecture. * Fixed struct seq_file visibility reported by kernel test robot. v3 -> v4: * Reordered patches * Naming changes * Fixed racing in /proc/pid/seccomp_cache against filter being released from task, using Jann's suggestion of sighand spinlock. * Cache no longer configurable. * Copied some description from cover letter to commit messages. * Used Kees's logic to set clear bits from bitmap, rather than set bits. v2 -> v3: * Added array_index_nospec guards * No more syscall_arches[] array and expecting on loop unrolling. Arches are configured with per-arch seccomp.h. * Moved filter emulation to attach time (from prepare time). * Further simplified emulator, basing on Kees's code. * Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN. v1 -> v2: * Corrected one outdated function documentation. RFC -> v1: * Config made on by default across all arches that could support it. * Added arch numbers array and emulate filter for each arch number, and have a per-arch bitmap. * Massively simplified the emulator so it would only support the common instructions in Kees's list. * Fixed inheriting bitmap across filters (filter->prev is always NULL during prepare). * Stole the selftest from Kees. * Added a /proc/pid/seccomp_cache by Jann's suggestion. Patch 1 implements the test_bit against the bitmaps. Patch 2 implements the emulator that finds if a filter must return allow, Patch 3 adds the arch macros for x86. Patch 4 updates the selftest to better show the new semantics. Patch 5 implements /proc/pid/seccomp_cache. [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ [3] https://github.com/seccomp/libseccomp/issues/116 [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 [6] Draco: Architectural and Operating System Support for System Call Security https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 Kees Cook (2): x86: Enable seccomp architecture tracking selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu (3): seccomp/cache: Lookup syscall allowlist bitmap for fast path seccomp/cache: Add "emulator" to check if filter is constant allow seccomp/cache: Report cache data through /proc/pid/seccomp_cache arch/Kconfig | 24 ++ arch/x86/Kconfig | 1 + arch/x86/include/asm/seccomp.h | 20 ++ fs/proc/base.c | 6 + include/linux/seccomp.h | 7 + kernel/seccomp.c | 292 +++++++++++++++++- .../selftests/seccomp/seccomp_benchmark.c | 151 +++++++-- tools/testing/selftests/seccomp/settings | 2 +- 8 files changed, 479 insertions(+), 24 deletions(-) -- 2.28.0
next prev parent reply index Thread overview: 149+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-09-21 5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu 2020-09-21 5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu 2020-09-21 17:47 ` Jann Horn 2020-09-21 18:38 ` Jann Horn 2020-09-21 23:44 ` YiFei Zhu 2020-09-22 0:25 ` Jann Horn 2020-09-22 0:47 ` YiFei Zhu 2020-09-21 5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu 2020-09-21 18:08 ` Jann Horn 2020-09-21 22:50 ` YiFei Zhu 2020-09-21 22:57 ` Jann Horn 2020-09-21 23:08 ` YiFei Zhu 2020-09-25 0:01 ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook 2020-09-25 0:15 ` Jann Horn 2020-09-25 0:18 ` Al Viro 2020-09-25 0:24 ` Jann Horn 2020-09-25 1:27 ` YiFei Zhu 2020-09-25 3:09 ` Kees Cook 2020-09-25 3:28 ` YiFei Zhu 2020-09-25 16:39 ` YiFei Zhu 2020-09-21 5:48 ` [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls Sargun Dhillon 2020-09-21 7:13 ` YiFei Zhu 2020-09-21 8:30 ` Christian Brauner 2020-09-21 8:44 ` YiFei Zhu 2020-09-21 13:51 ` Tycho Andersen 2020-09-21 15:27 ` YiFei Zhu 2020-09-21 16:39 ` Tycho Andersen 2020-09-21 22:57 ` YiFei Zhu 2020-09-21 19:16 ` Jann Horn [not found] ` <OF8837FC1A.5C0D4D64-ON852585EA.006B677F-852585EA.006BA663@notes.na.collabserv.com> 2020-09-21 19:45 ` Jann Horn 2020-09-23 19:26 ` Kees Cook 2020-09-23 22:54 ` YiFei Zhu 2020-09-24 6:52 ` Kees Cook 2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu 2020-09-24 12:06 ` YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu 2020-09-24 12:06 ` [PATCH seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu 2020-09-24 19:11 ` Kees Cook 2020-10-27 9:52 ` Geert Uytterhoeven 2020-10-27 19:08 ` YiFei Zhu 2020-10-28 0:06 ` Kees Cook 2020-10-28 8:18 ` Geert Uytterhoeven 2020-10-28 9:34 ` Jann Horn 2020-09-24 12:44 ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu 2020-09-24 13:47 ` David Laight 2020-09-24 14:16 ` YiFei Zhu 2020-09-24 14:20 ` David Laight 2020-09-24 14:37 ` YiFei Zhu 2020-09-24 16:02 ` YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu 2020-09-24 23:25 ` Kees Cook 2020-09-25 3:04 ` YiFei Zhu 2020-09-25 16:45 ` YiFei Zhu 2020-09-25 19:42 ` Kees Cook 2020-09-25 19:51 ` Andy Lutomirski 2020-09-25 20:37 ` Kees Cook 2020-09-25 21:07 ` Andy Lutomirski 2020-09-25 23:49 ` Kees Cook 2020-09-26 0:34 ` Andy Lutomirski 2020-09-26 1:23 ` YiFei Zhu 2020-09-26 2:47 ` Andy Lutomirski 2020-09-26 4:35 ` Kees Cook 2020-09-24 12:44 ` [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu 2020-09-24 23:46 ` Kees Cook 2020-09-25 1:55 ` YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu 2020-09-24 23:47 ` Kees Cook 2020-09-25 1:35 ` YiFei Zhu 2020-09-24 12:44 ` [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-09-24 23:56 ` Kees Cook 2020-09-25 3:11 ` YiFei Zhu 2020-09-25 3:26 ` Kees Cook 2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 2020-09-30 15:19 ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu 2020-09-30 21:21 ` Kees Cook 2020-09-30 21:33 ` Jann Horn 2020-09-30 22:53 ` Kees Cook 2020-09-30 23:15 ` Jann Horn 2020-09-30 15:19 ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu 2020-09-30 22:24 ` Jann Horn 2020-09-30 22:49 ` Kees Cook 2020-10-01 11:28 ` YiFei Zhu 2020-10-01 21:08 ` Jann Horn 2020-09-30 22:40 ` Kees Cook 2020-10-01 11:52 ` YiFei Zhu 2020-10-01 21:05 ` Kees Cook 2020-10-02 11:08 ` YiFei Zhu 2020-10-09 4:47 ` YiFei Zhu 2020-10-09 5:41 ` Kees Cook 2020-09-30 15:19 ` [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu 2020-09-30 21:32 ` Kees Cook 2020-10-09 0:17 ` YiFei Zhu 2020-10-09 5:35 ` Kees Cook 2020-09-30 15:19 ` [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu 2020-09-30 15:19 ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-09-30 22:00 ` Jann Horn 2020-09-30 23:12 ` Kees Cook 2020-10-01 12:06 ` YiFei Zhu 2020-10-01 16:05 ` Jann Horn 2020-10-01 16:18 ` YiFei Zhu 2020-09-30 22:59 ` Kees Cook 2020-09-30 23:08 ` Jann Horn 2020-09-30 23:21 ` Kees Cook 2020-10-09 17:14 ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu 2020-10-09 17:14 ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu 2020-10-09 21:30 ` Jann Horn 2020-10-09 23:18 ` Kees Cook 2020-10-09 17:14 ` [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu 2020-10-09 21:30 ` Jann Horn 2020-10-09 22:47 ` Kees Cook 2020-10-09 17:14 ` [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu 2020-10-09 17:25 ` Andy Lutomirski 2020-10-09 18:32 ` YiFei Zhu 2020-10-09 20:59 ` Andy Lutomirski 2020-10-09 17:14 ` [PATCH v4 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu 2020-10-09 17:14 ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-10-09 21:45 ` Jann Horn 2020-10-09 23:14 ` Kees Cook 2020-10-10 13:26 ` YiFei Zhu 2020-10-12 22:57 ` Kees Cook 2020-10-13 0:31 ` YiFei Zhu 2020-10-22 20:52 ` YiFei Zhu 2020-10-22 22:32 ` Kees Cook 2020-10-22 23:40 ` YiFei Zhu 2020-10-24 2:51 ` Kees Cook 2020-10-30 12:18 ` YiFei Zhu 2020-11-03 13:00 ` YiFei Zhu 2020-11-04 0:29 ` Kees Cook 2020-11-04 11:40 ` YiFei Zhu 2020-11-04 18:57 ` Kees Cook 2020-10-11 15:47 ` YiFei Zhu [this message] 2020-10-11 15:47 ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu 2020-10-12 6:42 ` Jann Horn 2020-10-11 15:47 ` [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu 2020-10-12 6:46 ` Jann Horn 2020-10-11 15:47 ` [PATCH v5 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu 2020-10-11 15:47 ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu 2020-10-12 6:49 ` Jann Horn 2020-12-17 12:14 ` Geert Uytterhoeven 2020-12-17 18:34 ` YiFei Zhu 2020-12-18 12:35 ` Geert Uytterhoeven 2020-10-27 19:14 ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results Kees Cook
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=cover.1602431034.git.yifeifz2@illinois.edu \ --to=zhuyifei1999@gmail.com \ --cc=David.Laight@aculab.com \ --cc=aarcange@redhat.com \ --cc=bpf@vger.kernel.org \ --cc=containers@lists.linux-foundation.org \ --cc=cyphar@cyphar.com \ --cc=dskarlat@cs.cmu.edu \ --cc=frankeh@us.ibm.com \ --cc=gscrivan@redhat.com \ --cc=jannh@google.com \ --cc=jianyan2@illinois.edu \ --cc=keescook@chromium.org \ --cc=linux-kernel@vger.kernel.org \ --cc=luto@amacapital.net \ --cc=tobin@ibm.com \ --cc=torrella@illinois.edu \ --cc=tycho@tycho.pizza \ --cc=tyxu@illinois.edu \ --cc=vrothber@redhat.com \ --cc=wad@chromium.org \ --cc=yifeifz2@illinois.edu \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
BPF Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \ bpf@vger.kernel.org public-inbox-index bpf Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.bpf AGPL code for this site: git clone https://public-inbox.org/public-inbox.git