[RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls

* [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
@ 2020-09-21  5:35 YiFei Zhu
  2020-09-21  5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
                   ` (8 more replies)
  0 siblings, 9 replies; 149+ messages in thread
From: YiFei Zhu @ 2020-09-21  5:35 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg

From: YiFei Zhu <yifeifz2@illinois.edu>

This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We propose SECCOMP_CACHE, a cache-based solution to minimize the
Seccomp overhead. The basic idea is to cache the result of each
syscall check to save the subsequent overhead of executing the
filters. This is feasible, because the check in Seccomp is stateless.
The checking results of the same syscall ID and argument remains
the same.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

In the past Kees proposed [2] to have an "add this syscall to the
reject bitmask". It is indeed much easier to securely make a reject
accelerator to pre-filter syscalls before passing to the BPF
filters, considering it could only strengthen the security provided
by the filter. However, ultimately, filter rejections are an
exceptional / rare case. Here, instead of accelerating what is
rejected, we accelerate what is allowed. In order not to compromise
the security rules the BPF filters defined, any accept-side
accelerator must complement the BPF filters rather than replacing them.

Statically analyzing BPF bytecode to see if each syscall is going to
always land in allow or reject is more of a rabbit hole, especially
there is no current in-kernel infrastructure to enumerate all the
possible architecture numbers for a given machine. So rather than
doing that, we propose to cache the results after the BPF filters are
run. And since there are filters like docker's who will check
arguments of some syscalls, but not all or none of the syscalls, when
a filter is loaded we analyze it to find whether each syscall is
cacheable (does not access syscall argument or instruction pointer) by
following its control flow graph, and store the result for each filter
in a bitmap. Changes to architecture number or the filter are expected
to be rare and simply cause the cache to be cleared. This solution
shall be fully transparent to userspace.

Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.

We have done some benchmarks with patch applied against bpf-next
commit 2e80be60c465 ("libbpf: Fix compilation warnings for 64-bit printf args").

Me, in qemu-kvm x86_64 VM, on Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz,
average results:

Without cache, seccomp_benchmark:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Calibrating sample size for 15 seconds worth of syscalls ...
  Benchmarking 23486415 syscalls...
  16.079642020 - 1.013345439 = 15066296581 (15.1s)
  getpid native: 641 ns
  32.080237410 - 16.080763500 = 15999473910 (16.0s)
  getpid RET_ALLOW 1 filter: 681 ns
  48.609461618 - 32.081296173 = 16528165445 (16.5s)
  getpid RET_ALLOW 2 filters: 703 ns
  Estimated total seccomp overhead for 1 filter: 40 ns
  Estimated total seccomp overhead for 2 filters: 62 ns
  Estimated seccomp per-filter overhead: 22 ns
  Estimated seccomp entry overhead: 18 ns

With cache:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Calibrating sample size for 15 seconds worth of syscalls ...
  Benchmarking 23486415 syscalls...
  16.059512499 - 1.014108434 = 15045404065 (15.0s)
  getpid native: 640 ns
  31.651075934 - 16.060637323 = 15590438611 (15.6s)
  getpid RET_ALLOW 1 filter: 663 ns
  47.367316169 - 31.652302661 = 15715013508 (15.7s)
  getpid RET_ALLOW 2 filters: 669 ns
  Estimated total seccomp overhead for 1 filter: 23 ns
  Estimated total seccomp overhead for 2 filters: 29 ns
  Estimated seccomp per-filter overhead: 6 ns
  Estimated seccomp entry overhead: 17 ns

Depending on the run estimated seccomp overhead for 2 filters can be
less than seccomp overhead for 1 filter, resulting in underflow to
estimated seccomp per-filter overhead:
  Estimated total seccomp overhead for 1 filter: 27 ns
  Estimated total seccomp overhead for 2 filters: 21 ns
  Estimated seccomp per-filter overhead: 18446744073709551610 ns
  Estimated seccomp entry overhead: 33 ns

Jack Chen has also run some benchmarks on a bare metal
Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz, with side channel
mitigations off (spec_store_bypass_disable=off spectre_v2=off mds=off
pti=off l1tf=off), with BPF JIT on and docker default profile,
and reported:

  unixbench syscall mix (https://github.com/kdlucas/byte-unixbench)
  unconfined:      33295685
  docker default:         20661056  60%
  docker default + cache: 25719937  30%

Patch 1 introduces the static analyzer to check for a given filter,
whether the CFG loads the syscall arguments for each syscall number.

Patch 2 implements the bitmap cache.

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020

YiFei Zhu (2):
  seccomp/cache: Add "emulator" to check if filter is arg-dependent
  seccomp/cache: Cache filter results that allow syscalls

 arch/x86/Kconfig        |  27 +++
 include/linux/seccomp.h |  22 +++
 kernel/seccomp.c        | 400 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 446 insertions(+), 3 deletions(-)

--
2.28.0

^ permalink raw reply	[flat|nested] 149+ messages in thread