bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: YiFei Zhu <zhuyifei1999@gmail.com>
To: containers@lists.linux-foundation.org
Cc: YiFei Zhu <yifeifz2@illinois.edu>,
	bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
	Aleksa Sarai <cyphar@cyphar.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andy Lutomirski <luto@amacapital.net>,
	David Laight <David.Laight@aculab.com>,
	Dimitrios Skarlatos <dskarlat@cs.cmu.edu>,
	Giuseppe Scrivano <gscrivan@redhat.com>,
	Hubertus Franke <frankeh@us.ibm.com>,
	Jack Chen <jianyan2@illinois.edu>, Jann Horn <jannh@google.com>,
	Josep Torrellas <torrella@illinois.edu>,
	Kees Cook <keescook@chromium.org>, Tianyin Xu <tyxu@illinois.edu>,
	Tobin Feldman-Fitzthum <tobin@ibm.com>,
	Tycho Andersen <tycho@tycho.pizza>,
	Valentin Rothberg <vrothber@redhat.com>,
	Will Drewry <wad@chromium.org>
Subject: [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results
Date: Wed, 30 Sep 2020 10:19:11 -0500	[thread overview]
Message-ID: <cover.1601478774.git.yifeifz2@illinois.edu> (raw)
In-Reply-To: <cover.1600661418.git.yifeifz2@illinois.edu>

From: YiFei Zhu <yifeifz2@illinois.edu>

Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/

Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.
* Architectures supported by default through arch number array,
  except for MIPS with its sparse syscall numbers.
* Configurable per-build for future different cache modes.

This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.

Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.

Some benchmarks are performed with results in patch 5, copied below:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Benchmarking 200000000 syscalls...
  129.359381409 - 0.008724424 = 129350656985 (129.4s)
  getpid native: 646 ns
  264.385890006 - 129.360453229 = 135025436777 (135.0s)
  getpid RET_ALLOW 1 filter (bitmap): 675 ns
  399.400511893 - 264.387045901 = 135013465992 (135.0s)
  getpid RET_ALLOW 2 filters (bitmap): 675 ns
  545.872866260 - 399.401718327 = 146471147933 (146.5s)
  getpid RET_ALLOW 3 filters (full): 732 ns
  696.337101319 - 545.874097681 = 150463003638 (150.5s)
  getpid RET_ALLOW 4 filters (full): 752 ns
  Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
  Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
  Estimated total seccomp overhead for 3 full filters: 86 ns
  Estimated total seccomp overhead for 4 full filters: 106 ns
  Estimated seccomp entry overhead: 29 ns
  Estimated seccomp per-filter overhead (last 2 diff): 20 ns
  Estimated seccomp per-filter overhead (filters / 4): 19 ns
  Expectations:
  	native ≤ 1 bitmap (646 ≤ 675): ✔️
  	native ≤ 1 filter (646 ≤ 732): ✔️
  	per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
  	1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
  	entry ≈ 1 bitmapped (29 ≈ 29): ✔️
  	entry ≈ 2 bitmapped (29 ≈ 29): ✔️
  	native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️

v2 -> v3:
* Added array_index_nospec guards
* No more syscall_arches[] array and expecting on loop unrolling. Arches
  are configured with per-arch seccomp.h.
* Moved filter emulation to attach time (from prepare time).
* Further simplified emulator, basing on Kees's code.
* Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN.

v1 -> v2:
* Corrected one outdated function documentation.

RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
  have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
  instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
  during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.

Patch 1 adds the arch macros for x86.

Patch 2 implements the emulator that finds if a filter must return allow,

Patch 3 implements the test_bit against the bitmaps.

Patch 4 updates the selftest to better show the new semantics.

Patch 5 implements /proc/pid/seccomp_cache.

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020

Kees Cook (2):
  x86: Enable seccomp architecture tracking
  selftests/seccomp: Compare bitmap vs filter overhead

YiFei Zhu (3):
  seccomp/cache: Add "emulator" to check if filter is constant allow
  seccomp/cache: Lookup syscall allowlist for fast path
  seccomp/cache: Report cache data through /proc/pid/seccomp_cache

 arch/Kconfig                                  |  49 ++++
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/seccomp.h                |  15 +
 fs/proc/base.c                                |   3 +
 include/linux/seccomp.h                       |   5 +
 kernel/seccomp.c                              | 265 +++++++++++++++++-
 .../selftests/seccomp/seccomp_benchmark.c     | 151 ++++++++--
 tools/testing/selftests/seccomp/settings      |   2 +-
 8 files changed, 467 insertions(+), 24 deletions(-)

--
2.28.0

  parent reply	other threads:[~2020-09-30 15:20 UTC|newest]

Thread overview: 149+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
2020-09-21  5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
2020-09-21 17:47   ` Jann Horn
2020-09-21 18:38     ` Jann Horn
2020-09-21 23:44     ` YiFei Zhu
2020-09-22  0:25       ` Jann Horn
2020-09-22  0:47         ` YiFei Zhu
2020-09-21  5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu
2020-09-21 18:08   ` Jann Horn
2020-09-21 22:50     ` YiFei Zhu
2020-09-21 22:57       ` Jann Horn
2020-09-21 23:08         ` YiFei Zhu
2020-09-25  0:01   ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook
2020-09-25  0:15     ` Jann Horn
2020-09-25  0:18       ` Al Viro
2020-09-25  0:24         ` Jann Horn
2020-09-25  1:27     ` YiFei Zhu
2020-09-25  3:09       ` Kees Cook
2020-09-25  3:28         ` YiFei Zhu
2020-09-25 16:39           ` YiFei Zhu
2020-09-21  5:48 ` [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls Sargun Dhillon
2020-09-21  7:13   ` YiFei Zhu
2020-09-21  8:30 ` Christian Brauner
2020-09-21  8:44   ` YiFei Zhu
2020-09-21 13:51 ` Tycho Andersen
2020-09-21 15:27   ` YiFei Zhu
2020-09-21 16:39     ` Tycho Andersen
2020-09-21 22:57       ` YiFei Zhu
2020-09-21 19:16 ` Jann Horn
     [not found]   ` <OF8837FC1A.5C0D4D64-ON852585EA.006B677F-852585EA.006BA663@notes.na.collabserv.com>
2020-09-21 19:45     ` Jann Horn
2020-09-23 19:26 ` Kees Cook
2020-09-23 22:54   ` YiFei Zhu
2020-09-24  6:52     ` Kees Cook
2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
2020-09-24 12:06     ` YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-09-24 12:44   ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
2020-09-24 19:11       ` Kees Cook
2020-10-27  9:52       ` Geert Uytterhoeven
2020-10-27 19:08         ` YiFei Zhu
2020-10-28  0:06         ` Kees Cook
2020-10-28  8:18           ` Geert Uytterhoeven
2020-10-28  9:34             ` Jann Horn
2020-09-24 12:44     ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu
2020-09-24 13:47       ` David Laight
2020-09-24 14:16         ` YiFei Zhu
2020-09-24 14:20           ` David Laight
2020-09-24 14:37             ` YiFei Zhu
2020-09-24 16:02               ` YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
2020-09-24 23:25       ` Kees Cook
2020-09-25  3:04         ` YiFei Zhu
2020-09-25 16:45           ` YiFei Zhu
2020-09-25 19:42             ` Kees Cook
2020-09-25 19:51               ` Andy Lutomirski
2020-09-25 20:37                 ` Kees Cook
2020-09-25 21:07                   ` Andy Lutomirski
2020-09-25 23:49                     ` Kees Cook
2020-09-26  0:34                       ` Andy Lutomirski
2020-09-26  1:23                     ` YiFei Zhu
2020-09-26  2:47                       ` Andy Lutomirski
2020-09-26  4:35                         ` Kees Cook
2020-09-24 12:44     ` [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
2020-09-24 23:46       ` Kees Cook
2020-09-25  1:55         ` YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-09-24 23:47       ` Kees Cook
2020-09-25  1:35         ` YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-09-24 23:56       ` Kees Cook
2020-09-25  3:11         ` YiFei Zhu
2020-09-25  3:26           ` Kees Cook
2020-09-30 15:19 ` YiFei Zhu [this message]
2020-09-30 15:19   ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu
2020-09-30 21:21     ` Kees Cook
2020-09-30 21:33       ` Jann Horn
2020-09-30 22:53         ` Kees Cook
2020-09-30 23:15           ` Jann Horn
2020-09-30 15:19   ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
2020-09-30 22:24     ` Jann Horn
2020-09-30 22:49       ` Kees Cook
2020-10-01 11:28       ` YiFei Zhu
2020-10-01 21:08         ` Jann Horn
2020-09-30 22:40     ` Kees Cook
2020-10-01 11:52       ` YiFei Zhu
2020-10-01 21:05         ` Kees Cook
2020-10-02 11:08           ` YiFei Zhu
2020-10-09  4:47     ` YiFei Zhu
2020-10-09  5:41       ` Kees Cook
2020-09-30 15:19   ` [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
2020-09-30 21:32     ` Kees Cook
2020-10-09  0:17       ` YiFei Zhu
2020-10-09  5:35         ` Kees Cook
2020-09-30 15:19   ` [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-09-30 15:19   ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-09-30 22:00     ` Jann Horn
2020-09-30 23:12       ` Kees Cook
2020-10-01 12:06       ` YiFei Zhu
2020-10-01 16:05         ` Jann Horn
2020-10-01 16:18           ` YiFei Zhu
2020-09-30 22:59     ` Kees Cook
2020-09-30 23:08       ` Jann Horn
2020-09-30 23:21         ` Kees Cook
2020-10-09 17:14   ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
2020-10-09 17:14     ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
2020-10-09 21:30       ` Jann Horn
2020-10-09 23:18       ` Kees Cook
2020-10-09 17:14     ` [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
2020-10-09 21:30       ` Jann Horn
2020-10-09 22:47         ` Kees Cook
2020-10-09 17:14     ` [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu
2020-10-09 17:25       ` Andy Lutomirski
2020-10-09 18:32         ` YiFei Zhu
2020-10-09 20:59           ` Andy Lutomirski
2020-10-09 17:14     ` [PATCH v4 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-10-09 17:14     ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-10-09 21:45       ` Jann Horn
2020-10-09 23:14       ` Kees Cook
2020-10-10 13:26         ` YiFei Zhu
2020-10-12 22:57           ` Kees Cook
2020-10-13  0:31             ` YiFei Zhu
2020-10-22 20:52               ` YiFei Zhu
2020-10-22 22:32                 ` Kees Cook
2020-10-22 23:40                   ` YiFei Zhu
2020-10-24  2:51                     ` Kees Cook
2020-10-30 12:18                       ` YiFei Zhu
2020-11-03 13:00                         ` YiFei Zhu
2020-11-04  0:29                           ` Kees Cook
2020-11-04 11:40                             ` YiFei Zhu
2020-11-04 18:57                               ` Kees Cook
2020-10-11 15:47     ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
2020-10-11 15:47       ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
2020-10-12  6:42         ` Jann Horn
2020-10-11 15:47       ` [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
2020-10-12  6:46         ` Jann Horn
2020-10-11 15:47       ` [PATCH v5 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu
2020-10-11 15:47       ` [PATCH v5 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-10-11 15:47       ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-10-12  6:49         ` Jann Horn
2020-12-17 12:14         ` Geert Uytterhoeven
2020-12-17 18:34           ` YiFei Zhu
2020-12-18 12:35             ` Geert Uytterhoeven
2020-10-27 19:14       ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results Kees Cook

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1601478774.git.yifeifz2@illinois.edu \
    --to=zhuyifei1999@gmail.com \
    --cc=David.Laight@aculab.com \
    --cc=aarcange@redhat.com \
    --cc=bpf@vger.kernel.org \
    --cc=containers@lists.linux-foundation.org \
    --cc=cyphar@cyphar.com \
    --cc=dskarlat@cs.cmu.edu \
    --cc=frankeh@us.ibm.com \
    --cc=gscrivan@redhat.com \
    --cc=jannh@google.com \
    --cc=jianyan2@illinois.edu \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=tobin@ibm.com \
    --cc=torrella@illinois.edu \
    --cc=tycho@tycho.pizza \
    --cc=tyxu@illinois.edu \
    --cc=vrothber@redhat.com \
    --cc=wad@chromium.org \
    --cc=yifeifz2@illinois.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).